This document is the summary of the R for Data Analysis workshop.

All correspondence related to this document should be addressed to:

Omid Ghasemi (Macquarie University, Sydney, NSW, 2109, AUSTRALIA)

Email:

1 Introduction to R

1.1 Basics and Variables

R can be used as a calculator. For mathematical purposes, be careful of the order in which R executes the commands.

10 + 10
## [1] 20
4 ^ 2
## [1] 16
(250 / 500) * 100
## [1] 50

R is a bit flexible with spacing (but no spacing in the name of variables and words)

10+10
## [1] 20
10                 +           10
## [1] 20

R can sometimes tell that you’re not finished yet

10 +

How to create a variable? Variable assignment using <- and =. Note that R is case sensitive for everything

pay <- 250

month = 12

pay * month
## [1] 3000
salary <- pay * month

Few points in naming variables and vectors: use short, informative words, keep same method (e.g., you can use capital letters but it is not recommended, use only _ or . ).

1.2 Function

Function is a set of statements combined together to perform a specific task. When we use a block of code repeatedly, we can convert it to a function. To write a function, first, you need to define it:

my_multiplier <- function(a,b){
  result = a * b
  return (result)
}

This code do nothing. To get a result, you need to call it:

my_multiplier (a=2, b=4)
## [1] 8
# or: my_multiplier (2, 4)

We can set a default value for our arguments:

my_multiplier2 <- function(a,b=4){
  result = a * b
  return (result)
}

my_multiplier2 (a=2)
## [1] 8
# or: my_multiplier (2)
# or: my_multiplier (2, 6)

Fortunately, you do not need to write everything from scratch. R has lots of built-in functions that you can use:

round(54.6787)
## [1] 55
round(54.5787, digits = 2)
## [1] 54.58

Use ? before the function name to get some help. For example, ?round. You will see many functions in the rest of the workshop.

1.3 Data Types

function class() is used to show what is the type of a variable.

  1. Logical: TRUE, FALSE can be abbreviated as T, F. They has to be capital, ‘true’ is not a logical data:
class(TRUE)
## [1] "logical"
class(F)
## [1] "logical"
  1. Numeric: all numbers e.g. 5, 10.5, 11,37; a special type of numeric is “integer” which is numbers without decimal. Integers are always numeric, but numeric is not always integer:
class(2)
## [1] "numeric"
class(13.46)
## [1] "numeric"
  1. Character: text for example, “I love R” or “4” or “4.5”:
class("ha ha ha ha")
## [1] "character"
class("56.6")
## [1] "character"
class("TRUE")
## [1] "character"

Can we change the type of data in a variable? Yes, you need to use the function as.---()

as.numeric(TRUE)
## [1] 1
as.character(4)
## [1] "4"
as.numeric("4.5")
## [1] 4.5
as.numeric("Hello")
## Warning: NAs introduced by coercion
## [1] NA

1.4 Data Structures

1.4.1 Vector

When there are more than one number or letter stored. Use the combine function c() for that.

sale <- c(1, 2, 3,4, 5, 6, 7, 8, 9, 10) # also sale <- c(1:10)

sale <- c(1:10)

sale * sale
##  [1]   1   4   9  16  25  36  49  64  81 100

Subsetting a vector:

days <- c("Saturday", "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday")

days[2]
## [1] "Sunday"
days[-2]
## [1] "Saturday"  "Monday"    "Tuesday"   "Wednesday" "Thursday"  "Friday"
days[c(2, 3, 4)]
## [1] "Sunday"  "Monday"  "Tuesday"
  • Exercise: Create a vector named my_vector with numbers from 0 to 1000 in it and calculate mean, median, sd, min, max, and sum of that vector:
my_vector <- (0:1000)

mean(my_vector)
## [1] 500
median(my_vector)
## [1] 500
min(my_vector)
## [1] 0
range(my_vector)
## [1]    0 1000
class(my_vector)
## [1] "integer"
sum(my_vector)
## [1] 500500
sd(my_vector)
## [1] 289.1081

1.4.2 List

List allows you to gather a variety of objects under one name (that is, the name of the list) in an ordered way. These objects can be matrices, vectors, data frames, even other list.

my_list = list(sale, 1, 3, 4:7, "HELLO", "hello", FALSE)
my_list
## [[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## [[2]]
## [1] 1
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] 4 5 6 7
## 
## [[5]]
## [1] "HELLO"
## 
## [[6]]
## [1] "hello"
## 
## [[7]]
## [1] FALSE

1.4.3 Factor

Factors store the vector along with the distinct values of the elements in the vector as labels. The labels are always character irrespective of whether it is numeric or character. For example, variable gender with “male” and “female” entries:

gender <- c("male", "male", "male", " female", "female", "female")
gender <- factor(gender)

R now treats gender as a nominal (categorical) variable: 1=female, 2=male internally (alphabetically).

summary(gender)
##  female  female    male 
##       1       2       3
  • Question: why when we ran the above function i.e. summary(), it showed three and not two levels of the data? Hint: run ‘gender’.
gender
## [1] male    male    male     female female  female 
## Levels:  female female male

So, be careful of spaces!

  • Exercise: Create a gender factor with 30 male and 40 females (Hint: use the rep() function):
gender <- c(rep("male",30), rep("female", 40))
gender <- factor(gender)
gender
##  [1] male   male   male   male   male   male   male   male   male   male  
## [11] male   male   male   male   male   male   male   male   male   male  
## [21] male   male   male   male   male   male   male   male   male   male  
## [31] female female female female female female female female female female
## [41] female female female female female female female female female female
## [51] female female female female female female female female female female
## [61] female female female female female female female female female female
## Levels: female male

There are two types of categorical variables: nominal and ordinal. How to create ordered factors (when the variable is nominal and values can be ordered)? We should add two additional arguments to the factor() function: ordered = TRUE, and levels = c("level1", "level2"). For example, we have a vector that shows participants’ education level.

edu<-c(3,2,3,4,1,2,2,3,4)

education<-factor(edu, ordered = TRUE)
levels(education) <- c("Primary school","high school","College","Uni graduated")
education
## [1] College        high school    College        Uni graduated  Primary school
## [6] high school    high school    College        Uni graduated 
## Levels: Primary school < high school < College < Uni graduated
  • Exercise: We have a factor with patient and control values. Here, the first level is control and the second level is patient. Change the order of levels, so patient would be the first level:
health_status <- factor(c(rep('patient',5),rep('control',5)))
health_status
##  [1] patient patient patient patient patient control control control control
## [10] control
## Levels: control patient
health_status_reordered <- factor(health_status, levels = c('patient','control'))
health_status_reordered
##  [1] patient patient patient patient patient control control control control
## [10] control
## Levels: patient control

Finally, can you relabel both levels to uppercase characters? (Hint: check ?factor)

health_status_relabeled <- factor(health_status, levels = c('patient','control'), labels = c('Patient','Control'))
health_status_relabeled
##  [1] Patient Patient Patient Patient Patient Control Control Control Control
## [10] Control
## Levels: Patient Control

1.4.4 Matrices

All columns in a matrix must have the same mode(numeric, character, etc.) and the same length. It can be created using a vector input to the matrix function.

my_matrix = matrix(c(1,2,3,4,5,6,7,8,9), nrow = 3, ncol = 3)

my_matrix
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

1.4.5 Data frames

Data frames can hold numeric, character or logical values. Within a column all elements have the same data type, but different columns can be of different data type. Let’s create a dataframe:

id <- 1:200
group <- c(rep("Psychotherapy", 100), rep("Medication", 100))
response <- c(rnorm(100, mean = 30, sd = 5),
             rnorm(100, mean = 25, sd = 5))

my_dataframe <-data.frame(Patient = id,
                          Treatment = group,
                          Response = response)

We also could have done the below

my_dataframe <-data.frame(Patient = c(1:200),
                          Treatment = c(rep("Psychotherapy", 100), rep("Medication", 100)),
                          Response = c(rnorm(100, mean = 30, sd = 5),
                                       rnorm(100, mean = 25, sd = 5)))

In large data sets, the function head() enables you to show the first observations of a data frames. Similarly, the function tail() prints out the last observations in your data set.

head(my_dataframe) 
tail(my_dataframe)
Patient Treatment Response
1 1 Psychotherapy 30.42391
2 2 Psychotherapy 31.39435
3 3 Psychotherapy 31.98380
4 4 Psychotherapy 30.26156
5 5 Psychotherapy 32.89160
6 6 Psychotherapy 30.58312
Patient Treatment Response
195 195 Medication 18.54472
196 196 Medication 25.31502
197 197 Medication 22.52473
198 198 Medication 22.01121
199 199 Medication 29.19932
200 200 Medication 30.09011

Similar to vectors and matrices, brackets [] are used to selects data from rows and columns in data.frames:

my_dataframe[35, 3]
## [1] 19.96716
  • Exercise: How can we get all columns, but only for the first 10 participants?
my_dataframe[1:10, ]
Patient Treatment Response
1 Psychotherapy 30.42391
2 Psychotherapy 31.39435
3 Psychotherapy 31.98380
4 Psychotherapy 30.26156
5 Psychotherapy 32.89160
6 Psychotherapy 30.58312
7 Psychotherapy 29.31386
8 Psychotherapy 31.86684
9 Psychotherapy 31.60194
10 Psychotherapy 28.17029

How to get only the Response column for all participants?

my_dataframe[ , 3]
##   [1] 30.42391 31.39435 31.98380 30.26156 32.89160 30.58313 29.31386 31.86684
##   [9] 31.60194 28.17029 40.70102 31.00756 25.84319 29.56527 32.88131 17.93231
##  [17] 28.13834 28.38068 29.21312 30.41091 24.42506 28.72800 32.28634 27.02297
##  [25] 28.65416 36.01442 35.72138 27.42288 31.63057 34.75769 28.23678 23.25628
##  [33] 29.95145 28.87311 19.96716 25.39401 46.97635 26.88618 28.89566 35.20675
##  [41] 26.74658 26.21272 30.33895 32.62540 21.64545 23.72600 32.02974 33.69672
##  [49] 37.39239 29.75127 34.12932 25.31246 30.94949 17.77188 24.35454 33.22516
##  [57] 19.23332 30.87546 33.20793 30.50553 20.04208 41.22914 29.10625 33.77548
##  [65] 29.33675 31.01131 32.29470 30.55776 31.00207 29.95526 30.04618 31.56003
##  [73] 32.44361 29.62436 20.27625 31.37769 21.42085 37.07180 30.83059 30.90844
##  [81] 27.85175 29.43705 26.01446 33.49888 24.77147 32.02480 30.51447 28.56813
##  [89] 30.74553 31.14742 36.53496 24.67334 30.76875 38.65474 23.12777 32.82991
##  [97] 22.45032 35.49847 29.66192 21.85724 23.21670 18.26624 27.45448 34.09305
## [105] 30.46837 26.77456 25.37303 39.98304 26.29323 30.46605 20.37679 26.87346
## [113] 20.11632 23.56563 33.03331 30.29873 26.43332 25.79077 24.43063 28.59376
## [121] 27.90288 25.23165 19.69473 28.89037 26.06537 19.33005 24.58180 26.43273
## [129] 31.75669 18.81263 13.07141 25.53940 34.57619 24.38474 18.68158 25.61714
## [137] 29.46947 25.14800 31.91151 25.29593 29.32774 28.34083 33.82708 31.55684
## [145] 28.46168 22.96394 24.18663 22.86743 27.04206 23.95114 21.81046 21.36594
## [153] 22.17787 23.33292 33.63888 32.96656 26.29168 26.07104 28.96076 25.24093
## [161] 18.95290 29.48124 19.74454 26.28502 27.13508 22.27233 27.23583 33.87589
## [169] 23.84769 33.74727 20.97366 15.34242 22.37948 19.52661 27.65811 29.67194
## [177] 30.54342 23.39429 26.72360 28.42296 25.03079 23.70069 27.67194 26.06394
## [185] 32.07253 27.28652 25.06557 22.95932 24.46923 36.44290 25.86616 18.43971
## [193] 34.58480 33.33262 18.54472 25.31502 22.52473 22.01121 29.19932 30.09011

Another easier way for selecting particular items is using their names that is more helpful than number of the rows in large data sets:

my_dataframe[ , "Response"]
# OR:
my_dataframe$Response

So far, we created dataframes using data.frame function from the base R. However, a better way to create dataframes is to use the tibble function from tidyverse (see here).

2 Data Cleaning

Now, suppose we ran an experiment with 141 depressed patients. Participants were randomly assigned into two treatment groups: CBT or Psychodynamic psychotherapy. We measured self-report depression scores at 5 different stages of treatment:

  • Stage 1: Before starting any treatment. It is our base stage (pre-test)
  • Stage 2: After 5 sessions of psychotherapy (post-test1)
  • Stage 3: After 10 sessions of psychotherapy (post-test2)
  • Stage 4: At the end of the treatment (post-test3)
  • Stage 5: Three months after the treatment (post-test4)

let’s read and check the uncleaned data. But, first thing first. let’s install and then load the tidyvese package. We also need some other packages:

# Install it
install.packages("tidyverse")

# And then load it
library(tidyverse)

# Load other packages that you have already installed
library(here)
library(janitor)
library(broom)
library(afex)
library(emmeans)
library(knitr)
library(kableExtra)
library(ggsci)
library(patchwork)
library(skimr)
# install.packages("devtools")
# devtools::install_github("easystats/correlation")
library("correlation")
options(scipen=999) # turn off scientific notations
options(contrasts = c('contr.sum','contr.poly')) # set the contrast sum globally 
options(knitr.kable.NA = '')
# read the raw data
raw_data <- read_csv(here("raw_data","raw_data_exp1.csv"))
head(raw_data)
progress subject response_id consent_form age gender stage1_cbt stage2_cbt stage3_cbt stage4_cbt stage5_cbt stage1_dynamic stage2_dynamic stage3_dynamic stage4_dynamic stage5_dynamic anxiety1 anxiety2 anxiety3 anxiety4 anxiety5 anxiety6 anxiety7 anxiety8 group sleep_quality life_satisfaction
100 subj1 R_1f298znjmVzcOjp I consent 18 Female 90 31 33 47 50 5 5 5 5 5 6 5 3 Psychodynamic 9 9
100 subj2 R_tL0A9P33Gi18I0N I consent 18 Male 78 46 46 11 13 6 6 6 5 6 6 5 6 CBT 9 10
100 subj3 R_1LNyJhCKxTAAMOW I consent 19 Female 68 51 24 41 24 6 5 4 5 5 6 5 6 CBT 10 8
100 subj4 R_3enxzUsEYgs5r1a I consent 27 Female 100 21 11 6 31 6 6 6 1 6 6 6 1 Psychodynamic 8 7
100 subj5 R_2Qzl2096a4KNE29 I consent 19 Male 30 28 16 6 6 6 6 5 5 6 6 6 6 CBT 11 11
100 subj6 R_esb71WOTQySjusF I consent 20 Female 79 1 57 46 57 6 6 6 5 6 6 6 6 Psychodynamic 10 10
  • Exercise: There is a dataset in the cleaned_data folder named unicef_u5mr.csv. Read the dataset using read_csv and here.
unicef_data <- read_csv(here("cleaned_data","unicef_u5mr.csv"))

In order to clean the data, we use tidyverse which is a collection of packages to work with data. One of the tidyverse packages that we use regularly is dplyr which includes several functions:

  • mutate() adds new variables or change existing ones.
  • select() pick variables (columns) based on their names.
  • filter() picks cases (rows) based on their values.
  • summarise() gives a single single summary of the data (e.g., mean, counts, etc.)
  • arrange() changes the ordering of the rows.
  • group_by() divides your dataframe into grouped dataframes and allow you to do each of the above operations (except for arrange) on every one of them separately.

2.1 Select

Pick subject, age, and gender columns:

selected_data <- select(raw_data, subject, age, gender)

2.2 Filter

Now, do the following tasks: pick all the male participants, pick all the male participants or those greater than 25 years old, and finally pick all male participants and those greater than 25 years old:

# filter all males
fil_male <- filter(raw_data, gender == "Male")
# filter males and older than 25
fil_male_and_g25 <- filter(raw_data, gender == "Male" & age > 25 )
# filter males or older than 25
fil_male_or_g25 <- filter(raw_data, gender == "Male" | age > 25 )

2.3 Arrange

Arrange (order) your dataframe based on the age, once in an ascending order (youngers first) and once based on descending order (olders first):

# order participants based on their age
arranged_data <- arrange(raw_data, age)
# order participants based on their age (descendeing)
arranged_descending <- arrange(raw_data, desc(age))

2.4 Mutate

Create a column to show if the participant has finished the task or not:

mutated_data <- mutate (raw_data, finished= case_when(progress==100~ "Yes",T~ "No"))

2.5 Summarise

Summarize participants age and sd:

summarise(raw_data, mean= mean(age, na.rm=T),
          sd= sd (age, na.rm=T))
mean sd
21.27273 6.635655

2.6 Pipe Operators

A new function: pipe operators %>% pipes a value into the next function:

raw_data %>% 
  summarise(., mean= mean(age, na.rm=T),
            sd= sd (age, na.rm=T))
mean sd
21.27273 6.635655
raw_data %>% 
  summarise(mean= mean(age, na.rm=T),
            sd= sd (age, na.rm=T))
mean sd
21.27273 6.635655

Calculate the age mean of younger than 25 participants only:

raw_data %>% 
  filter (age < 25) %>%
  summarise(mean= mean(age, na.rm=T),
            sd= sd (age, na.rm=T))
mean sd
19.1913 1.515393

2.7 Group By

Calculate the age mean of younger than 25 participants for each gender separately:

raw_data %>% 
  filter (age < 25) %>%
  group_by(gender) %>%
  summarise(mean= mean(age, na.rm=T),
            sd= sd (age, na.rm=T)) %>%
  ungroup ()
gender mean sd
Female 19.21053 1.556693
Male 19.10000 1.333772
  • Exercise: Create a column to show if participant is older than 23 or not and then calculate sleep quality (sleep_quality) mean for each group separately:
raw_data %>%
  mutate(age_group = case_when(age > 23 ~ "greater than 23", T~ "younger than 23")) %>%
  group_by(age_group) %>%
  summarise(sleep_quality = mean(sleep_quality, na.rm=T))
age_group sleep_quality
greater than 23 9.000000
younger than 23 8.107438
  • Exercise: Add the anxiety total score (sum) to the dataframe and then convert subject column to factor:
anxiety_data <- raw_data %>%
  mutate(anxiety_total= anxiety1+anxiety2+anxiety3+anxiety4+anxiety5+anxiety6+anxiety7+anxiety8) %>%
  mutate(subject= factor(subject))

2.8 Pivoting

Next, we want to pivot our data to switch between long and wide format:

# Make you data long
long_data <- raw_data %>%
  select(subject, stage1_cbt:stage5_cbt,stage1_dynamic:stage5_dynamic) %>%
  pivot_longer(cols = c(stage1_cbt:stage5_dynamic), names_to = 'stage', values_to = 'depression_score')

# Make you data wide
wide_data <- long_data %>%
  pivot_wider(names_from = stage, values_from= depression_score)
  • Exercise: Convert the UNICEF dataset to long and wide formats:
unicef_data <- read_csv(here("cleaned_data","unicef_u5mr.csv"))

library(janitor)
unicef_data_cleaned <- unicef_data %>%
  clean_names()

unicef_long_data <- unicef_data_cleaned %>% pivot_longer(cols = c(u5mr_1950:u5mr_2015), names_to = 'year', values_to = 'u5mr')
unicef_wideg_data <- unicef_long_data %>% pivot_wider(names_from = 'year', values_from = 'u5mr')

Note: The codes for the previous exercise were taken from this blog post written by Simon Ejdemyr.

Now, let’s do some cleaning using dplyr, tidyr and other tidyverse libraries:

cleaned_data <- raw_data %>% 
  filter(progress == 100) %>% # filter out unfinished participants
  select(-consent_form) %>% #remove some useless columns
  # create a total score for our questionnaire
  mutate(anxiety_total= anxiety1+anxiety2+anxiety3+anxiety4+anxiety5+anxiety6+anxiety7+anxiety8) %>%
  select(-anxiety1:-anxiety8) %>%
  # make our dataframe long
  pivot_longer(cols = c(stage1_cbt:stage5_cbt,stage1_dynamic:stage5_dynamic),names_to = 'stage',values_to = 'depression_score') %>% 
  #pivot_wider(names_from = stage, values_from= depression_score) # this code change our dataframe back to wide
  filter(!is.na(depression_score)) %>% #remove rows with depression_score == NA
  mutate(stage= gsub("_.*", "", stage)) %>%
  select (subject, age, gender, group, stage, depression_score, anxiety_total, sleep_quality, life_satisfaction)
subject age gender group stage depression_score anxiety_total sleep_quality life_satisfaction
subj1 18 Female Psychodynamic stage1 90 39 9 9
subj1 18 Female Psychodynamic stage2 31 39 9 9
subj1 18 Female Psychodynamic stage3 33 39 9 9
subj1 18 Female Psychodynamic stage4 47 39 9 9
subj1 18 Female Psychodynamic stage5 50 39 9 9
subj2 18 Male CBT stage1 78 46 9 10

Ok, now the data is clean and tidy which means:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table (Wickham, 2014).

Check the dataframe and all the data types:

str(cleaned_data)
## tibble [655 × 9] (S3: tbl_df/tbl/data.frame)
##  $ subject          : chr [1:655] "subj1" "subj1" "subj1" "subj1" ...
##  $ age              : num [1:655] 18 18 18 18 18 18 18 18 18 18 ...
##  $ gender           : chr [1:655] "Female" "Female" "Female" "Female" ...
##  $ group            : chr [1:655] "Psychodynamic" "Psychodynamic" "Psychodynamic" "Psychodynamic" ...
##  $ stage            : chr [1:655] "stage1" "stage2" "stage3" "stage4" ...
##  $ depression_score : num [1:655] 90 31 33 47 50 78 46 46 11 13 ...
##  $ anxiety_total    : num [1:655] 39 39 39 39 39 46 46 46 46 46 ...
##  $ sleep_quality    : num [1:655] 9 9 9 9 9 9 9 9 9 9 ...
##  $ life_satisfaction: num [1:655] 9 9 9 9 9 10 10 10 10 10 ...

Finally, we save our data to the cleaned_data folder.

write_csv(cleaned_data, here("cleaned_data","cleaned_data_exp1.csv"))

3 Data Visualization

Before starting the ggplot, let’s try a visualization using a function from the Base R the plot() function shows the association of each variable against the other one in a data handy for data with few number of variables to see if there are any patterns

exam_data<- read_csv(here::here("cleaned_data", "exam_data.csv"))

plot(x = exam_data$Anxiety, y = exam_data$Exam)

The code also works without writing x and y, however, writing them is strongly recommended

plot(exam_data$Anxiety, exam_data$Exam)

ggplot, the gg in ggplot stands for grammar of graphics. Grammar of graphics basically says any graphical representation of data, can be produced by a series of layers. You can think of a layer as a plastic transparency. Lets draw the same plot using ggplot. Always, mention the data we are going to work with.

ggplot(data = exam_data, aes(x = Exam, y = Anxiety))

  • aes: aes which stands for aesthetics is a relationship between a variable in your dataset and an aspect of the plot that is going to visually convey the information to the reader

  • Visual elements are known as geoms (short for ‘geometric objects’) in ggplot 2. When we define a layer, we have to tell R what geom we want displayed on that layer (do we want a bar, line dot, etc.?)

ggplot(data = exam_data, aes(x = Exam, y = Anxiety))+ geom_point()

So, lets try some of them here like shape and size. Be careful with the + sign, if you clink enter for the next part of the code, the + sign should not go to the next line

ggplot(data = exam_data, aes(x = Exam, y = Anxiety))+
  geom_point(size = 2, shape = 8)

The current plot is not very informative about the patterns for each gender.

ggplot(data = exam_data, aes(x = Exam, y = Anxiety, color = Gender))+
  geom_point(size = 2, shape = 10)

ggplot(data = exam_data, aes(x = Exam, y = Anxiety, color = Gender, shape = Gender))+
  geom_point(size = 2, shape = 10)

Question: why the above code doesn’t make any change?

ggplot(data = exam_data, aes(x = Exam, y = Anxiety, color = Gender, shape = Gender))+
  geom_point(size = 2)

Can assign the first layer to a variable to reduce the length of codes for next layers.

My_graph <- ggplot(data = exam_data, aes(x = Exam, y = Anxiety))

My_graph + geom_point()

lets add a line to the current graph

My_graph + geom_point() + geom_smooth()

Aesthetics can be set for all layers of the plot (i.e., defined in the plot as a whole) or can be set individually for each geom in a plot.

My_graph + geom_point(aes(color = Gender)) + geom_smooth()

My_graph + geom_point(aes(color = Gender)) + geom_smooth(aes(color = Gender))

The shaded area around the line is the 95% confidence interval around the line. We can switch this off by adding se = F (which is short for ‘standard error = False’)

My_graph + geom_point() + geom_smooth(se = F)

What if we want our line to be a direct line?

My_graph + geom_point() + geom_smooth(se = F, method = lm)

How to change the labels of x and y axes?

My_graph + geom_point() + geom_smooth(se = F, method = lm) +
  labs(x = "Exam scores %", y = "Anxiety scores")

Histograms are used to show distributions of variables while bar charts are used to compare variables. Histograms plot quantitative data with ranges of the data grouped into bins or intervals while bar charts plot categorical data.

#ggplot(data = exam_data, aes(x = Anxiety, y = Exam )) + geom_histogram()
# the code above gives an error as geom_histogram can only have x or y axis in its aes()

ggplot(data = exam_data, aes(x = Anxiety)) + geom_histogram()

ggplot(data = exam_data, aes(y = Anxiety)) + geom_histogram()

ggplot(data = exam_data, aes(x = Anxiety)) + geom_histogram(bins = 31)

ggplot(data = exam_data, aes(x = Anxiety)) + geom_histogram(bins = 31, fill = "green")

ggplot(data = exam_data, aes(x = Anxiety)) + geom_histogram(bins = 31, fill = "green", col = "red")

Let’s stop using the My_graph variable and write the whole code from the start again for a bar chart

ggplot(data = exam_data, aes(x = Sleep_quality))+
  geom_bar()

Because we want to plot a summary of the data (the mean) rather than the raw scores themselves, we have to use a stat.

ggplot(data = exam_data, aes(x = Sleep_quality, y = Exam, fill = Gender))+
  geom_bar(stat = "summary", fun = "mean")

ggplot(data = exam_data, aes(x = Sleep_quality, y = Exam, fill = Gender))+
  geom_bar(stat = "summary", fun = "mean", position = "dodge")

The other way to get the same plot that the code above gives, is using the stat_summary function that takes the following general form: stat_summary(function = x, geom = y)

ggplot(data = exam_data, aes(x = Sleep_quality, y = Exam, fill = Gender))+
  stat_summary(fun = mean, geom = "bar", position = "dodge")

How to combine multiple plots? How to combine multiple plots? We can use the patchwork package. A nice tutorial on using this package can be found here

p1 = My_graph + geom_point(aes(color = Gender)) + geom_smooth()

p2 = ggplot(data = exam_data, aes(x = Anxiety)) + geom_histogram(bins = 31)

p3 = ggplot(data = exam_data, aes(x = Sleep_quality, y = Exam, fill = Gender))+
  stat_summary(fun = mean, geom = "bar", position = "dodge")

p4 = My_graph + geom_point() + geom_smooth(se = F, method = lm) +
  labs(x = "Exam scores %", y = "Anxiety scores")

combined = p1 + p2+ p3 + p4 + plot_layout(nrow = 4, byrow = F)

combined

p1 | p2 / p3 / p4

p1 | p2 / (p3 / p4)

ggsave() function, which is a versatile exporting function that can export as PostScript (.eps/.ps), tex (pictex), pdf, jpeg, tiff, png, bmp, svg and wmf (in Windows only). In its basic form, the structure of the function is very simple: ggsave(filename)

ggsave(combined, filename = here("outputs", "combined.png"), dpi=300)

Now that we learned the basics of ggplot, let’s draw some plot for our experiment data. First, we need to create a dataset with aggregated depression_score scores over group and stage. We will use this dataset for line and bar graphs.

library(ggsci)

data_exp1_orig <- read_csv(here("cleaned_data","cleaned_data_exp1.csv"))

data_exp1 <- data_exp1_orig%>% 
  #mutate_if(is.character, factor) %>%
  mutate(subject= factor(subject), # convert all characters to factor
         group = factor(group),
         stage = factor(stage))


aggregated_data_exp1 <- data_exp1 %>%
  group_by(stage, group) %>%
  mutate(depression_score = mean(depression_score)) %>%
  ungroup()


barplot_exp1 <- aggregated_data_exp1 %>%
  ggplot(aes(x=stage, y= depression_score, fill=group)) +
  geom_bar(stat = "identity", position= "dodge")+
  labs (x= '', y= "Depression Score") + 
  theme_bw() + 
  scale_fill_jama() 

#ggsave(barplot_exp1, filename = here("outputs","barplot_exp1.png"), dpi=300)


barplot_facet_exp1 <- aggregated_data_exp1 %>%
  ggplot(aes(x=group, y= depression_score, fill=stage)) +
  geom_bar(stat = "identity", position= "dodge")+
  labs (x= '', y= "Depression Score") + 
  theme_bw() + 
  theme(legend.position = "none",
        axis.text=element_text(size=11),
        axis.title = element_text(size = 12)) +
  facet_wrap(~stage, nrow = 1)+
  scale_fill_jco() 

#ggsave(barplot_facet_exp1, filename = here("outputs","barplot_facet_exp1.png"), dpi=300)


lineplot_exp1 <- aggregated_data_exp1 %>%
  ggplot(aes(x=factor(stage), y= depression_score, group= group, color= group)) +
  geom_line(aes(linetype= group)) +
  geom_point(size= 5)+
  labs (x= '', y= "Depression Score") + 
  theme_classic() +
  theme(legend.position = "bottom",
        axis.text=element_text(size=11),
        axis.title = element_text(size = 12)) +
  scale_color_nejm() 

#ggsave(lineplot_exp1, filename = here("outputs","lineplot_exp1.png"), dpi=300)


violinplot_exp1 <- data_exp1 %>%
  ggplot(aes(x=factor(stage), y= depression_score, fill= group)) +
  geom_violin()+
  labs (x= '', y= "Depression Score") + 
  theme_bw() + 
  theme(legend.position = "bottom",
        axis.text=element_text(size=11)) +
  scale_fill_d3() 

#ggsave(violinplot_exp1, filename = here("outputs","violinplot_exp1.png"), dpi=300)


boxplot_exp1 <- data_exp1 %>%
  ggplot(aes(x=factor(stage), y= depression_score, fill= group)) +
  geom_boxplot()+
  #geom_point(position = position_dodge(width=0.75), alpha= .5)+
  labs (x= '', y= "Depression Score") + 
  theme_bw() + 
  theme(legend.position = "bottom",
        axis.text=element_text(size=11)) +
  scale_fill_simpsons() 

#ggsave(boxplot_exp1, filename = here("outputs","boxplot_exp1.png"), dpi=300)


boxplot_facet_exp1 <- data_exp1 %>%
  ggplot(aes(x=factor(stage), y= depression_score, fill= group)) +
  geom_boxplot()+
  labs (x= '', y= "Depression Score") + 
  theme_bw() + 
  theme(legend.position = "bottom",
        axis.text=element_text(size=11),
        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  facet_wrap(~group)+
  scale_color_simpsons() 

#ggsave(boxplot_facet_exp1, filename = here("outputs","boxplot_facet_exp1.png"), dpi=300)

Let’s combine our plots:

combined_plot_exp1 <- barplot_facet_exp1 / (lineplot_exp1+violinplot_exp1+boxplot_exp1)
combined_plot_exp1

And here, we save our plots to the outputs folder.

ggsave(combined_plot_exp1, filename = here("outputs","combined_plot_exp1.png"), dpi=300, width = 12)

4 Descriptive Statistics

Now, let’s do some descriptive statistics. Now, we can open a new script called data_analysis.r and read some datasets. Then we use skimr package to describe our data.

narcissism_data <- read_csv(here("cleaned_data","narcissism_data.csv"))
narcissism_data %>% skimr::skim()
Data summary
Name Piped data
Number of rows 131
Number of columns 5
_______________________
Column type frequency:
character 1
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
subject 0 1 5 7 0 131 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
psychopathy 0 1 8.78 2.27 0 8.0 10 10 11 ▁▁▁▂▇
self_esteem 0 1 8.45 1.68 4 8.0 8 9 12 ▁▅▇▆▃
narcissism 0 1 38.20 6.15 19 33.5 39 43 48 ▁▂▇▇▆
mental_health 0 1 3.19 1.04 1 3.0 4 4 4 ▂▂▁▃▇
  • Exercise: Open the dataset called treatment_data.csv and do a descriptive analysis:
treatment_data <- read_csv(here("cleaned_data","treatment_data.csv"))
treatment_data %>% skimr::skim()
Data summary
Name Piped data
Number of rows 131
Number of columns 7
_______________________
Column type frequency:
character 3
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
subject 0 1 5 7 0 131 0
gender 0 1 4 6 0 2 0
treatment 0 1 3 13 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 21.15 6.52 16 18.0 19 20.0 63 ▇▁▁▁▁
anxiety 0 1 62.35 24.51 0 40.0 69 81.0 100 ▂▆▃▇▆
depression 0 1 52.50 22.12 0 34.5 51 71.0 100 ▂▇▇▆▃
life_satisfaction 0 1 41.02 23.93 0 21.0 39 56.5 100 ▅▇▅▃▂
  • Exercise: Do the same thing for the memory_data.csv.
memory_data <- read_csv(here("cleaned_data","memory_data.csv"))
memory_data %>% group_by(time) %>%
  skimr::skim()
Data summary
Name Piped data
Number of rows 262
Number of columns 5
_______________________
Column type frequency:
character 2
numeric 2
________________________
Group variables time

Variable type: character

skim_variable time n_missing complete_rate min max empty n_unique whitespace
subject post_test_memory 0 1 5 7 0 131 0
subject pre_test_memory 0 1 5 7 0 131 0
gender post_test_memory 0 1 4 6 0 2 0
gender pre_test_memory 0 1 4 6 0 2 0

Variable type: numeric

skim_variable time n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age post_test_memory 0 1 21.15 6.52 16 18.0 19 20.0 63 ▇▁▁▁▁
age pre_test_memory 0 1 21.15 6.52 16 18.0 19 20.0 63 ▇▁▁▁▁
memory_score post_test_memory 0 1 52.50 22.12 0 34.5 51 71.0 100 ▂▇▇▆▃
memory_score pre_test_memory 0 1 41.02 23.93 0 21.0 39 56.5 100 ▅▇▅▃▂

Now, let’s describe our experiment data:

data_exp1_orig <- read_csv(here("cleaned_data","cleaned_data_exp1.csv"))

How many participants in total?

data_exp1 %>% summarise(n= n_distinct(subject))
n
131

How many participants do we have in each group?

data_exp1 %>% 
  group_by(subject) %>% 
  filter(row_number()==1) %>% 
  ungroup () %>% 
  group_by(group) %>% 
  count() 
group n
CBT 66
Psychodynamic 65

Find the mean and sd for numeric variables using base R summary function:

data_exp1 %>% 
  group_by(subject) %>% 
  filter(row_number()==1) %>% 
  ungroup () %>%
  summary()
##     subject         age           gender                    group   
##  subj1  :  1   Min.   :16.00   Length:131         CBT          :66  
##  subj10 :  1   1st Qu.:18.00   Class :character   Psychodynamic:65  
##  subj101:  1   Median :19.00   Mode  :character                     
##  subj102:  1   Mean   :21.15                                        
##  subj103:  1   3rd Qu.:20.00                                        
##  subj104:  1   Max.   :63.00                                        
##  (Other):125                                                        
##     stage     depression_score anxiety_total  sleep_quality   
##  stage1:131   Min.   :  1.00   Min.   :19.0   Min.   : 0.000  
##  stage2:  0   1st Qu.: 29.00   1st Qu.:33.5   1st Qu.: 8.000  
##  stage3:  0   Median : 51.00   Median :39.0   Median :10.000  
##  stage4:  0   Mean   : 53.33   Mean   :38.2   Mean   : 8.779  
##  stage5:  0   3rd Qu.: 79.00   3rd Qu.:43.0   3rd Qu.:10.000  
##               Max.   :101.00   Max.   :48.0   Max.   :11.000  
##                                                               
##  life_satisfaction
##  Min.   : 4.00    
##  1st Qu.: 8.00    
##  Median : 8.00    
##  Mean   : 8.45    
##  3rd Qu.: 9.00    
##  Max.   :12.00    
## 

Alternatively, we can use skimr library:

data_exp1 %>% 
  group_by(subject) %>% 
  filter(row_number()==1) %>% 
  ungroup () %>% 
  dplyr::select (age, depression_score, anxiety_total, sleep_quality, life_satisfaction) %>% 
  skimr::skim()
skim_type skim_variable n_missing complete_rate numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
numeric age 0 1 21.152672 6.515630 16 18.0 19 20 63 ▇▁▁▁▁
numeric depression_score 0 1 53.328244 27.967685 1 29.0 51 79 101 ▅▇▅▇▅
numeric anxiety_total 0 1 38.198473 6.153698 19 33.5 39 43 48 ▁▂▇▇▆
numeric sleep_quality 0 1 8.778626 2.274576 0 8.0 10 10 11 ▁▁▁▂▇
numeric life_satisfaction 0 1 8.450382 1.683466 4 8.0 8 9 12 ▁▅▇▆▃
  • Exercise: For this exercise, we use a dataset of one of my own studies. In this study, we asked participants to guess the physical brightness of reasoning arguments and then we gave a cognitive ability test. (See the original study here). Open ghasemi_brightness_exp4.csv file and answer to the following questions:
  1. How many participants did we test in total?
  2. Find out how many male and female we tested.
  3. Calculate mean and sd for age and cognitive ability (cog_ability).
ghasemi_data <- read_csv(here("cleaned_data","ghasemi_brightness_exp4.csv"))

ghasemi_data %>% summarise(n = n_distinct(participant)) # number of participants:200
n
200
ghasemi_data %>% group_by (participant) %>% filter (row_number()==1) %>% group_by (gender) %>% summarise(n= n()) %>% ungroup() # 183 female, 17 male
gender n
Female 183
Male 17
ghasemi_data %>% dplyr::select (age, cog_ability) %>% skimr::skim() # mean and sd for age and cognitive ability
Data summary
Name Piped data
Number of rows 38400
Number of columns 2
_______________________
Column type frequency:
numeric 2
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 22.20 6.78 17 19 20 22 52 ▇▁▁▁▁
cog_ability 0 1 39.55 9.46 11 34 40 46 61 ▁▃▇▆▂

5 Data Analysis

5.1 t-test

Now, we use the treatment data to run three different independent t-tests. Suppose we did an experiment to compare the effectiveness of CBT vs. Psychodynamic therapies in decreasing anxiety, and depression and also in improving life satisfaction:

# t.test (indep)
t.test(anxiety~treatment, data= treatment_data)
## 
##  Welch Two Sample t-test
## 
## data:  anxiety by treatment
## t = -0.85021, df = 124.18, p-value = 0.3968
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -12.11096   4.83264
## sample estimates:
##           mean in group CBT mean in group Psychodynamic 
##                    60.54545                    64.18462
t.test(depression~treatment, data= treatment_data)
## 
##  Welch Two Sample t-test
## 
## data:  depression by treatment
## t = -2.8725, df = 123.97, p-value = 0.004792
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -18.21965  -3.35424
## sample estimates:
##           mean in group CBT mean in group Psychodynamic 
##                    47.15152                    57.93846
t.test(life_satisfaction~treatment, data= treatment_data)
## 
##  Welch Two Sample t-test
## 
## data:  life_satisfaction by treatment
## t = -5.2688, df = 127.11, p-value = 0.0000005699
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -27.61850 -12.53721
## sample estimates:
##           mean in group CBT mean in group Psychodynamic 
##                    31.06061                    51.13846

In another experiment, suppose we have created a method to boost memory. Then, we recruit some participants, do a memory pre-test, implement the method, and do a memory post-test, Now, we want to see whether our method have improved participants’ memory:

# t.test (paired)
t.test(memory_score~time, data= memory_data, paired= T)
## 
##  Paired t-test
## 
## data:  memory_score by time
## t = 5.4761, df = 130, p-value = 0.0000002163
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   7.333171 15.628661
## sample estimates:
## mean of the differences 
##                11.48092

Now that we learned about t-test, let’s perform this test on our dataset. Is there a difference between groups at the first stage? Ideally, we want participants’ depresion score at the first stage to be similar for both groups because we have not started our treatment yet. Previous graphs showed us that depression scores of the CBT and Psychodynamic groups at this stage are pretty close. Let’s test that using an independent t-test (because we have 2 independent groups):

# Is there a difference between groups at the first stage?
data_exp1 %>% 
  group_by(group) %>% 
  filter(stage=='stage1') %>% 
  ungroup () %>%
  t.test(depression_score~group, data = ., paired=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  depression_score by group
## t = 0.10768, df = 118.92, p-value = 0.9144
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -9.205588 10.264329
## sample estimates:
##           mean in group CBT mean in group Psychodynamic 
##                    53.59091                    53.06154

Now, we wonder if psychotherapy treatments were effective at all, regardless of the treatment method. So, we would like to test if depresion score at the forth stage are lower than scores at the stage 2? Since a pair of score at stage 2 and stage 4 is coming from a same person, we use paired t-test.

# Is there a difference between ratings of stage2 and stage4?
data_exp1 %>% 
  filter(stage=='stage2' | stage=='stage4') %>% 
  ungroup () %>%
  t.test(depression_score~stage, data = ., paired=TRUE)
## 
##  Paired t-test
## 
## data:  depression_score by stage
## t = 5.5931, df = 130, p-value = 0.0000001261
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   7.70108 16.13098
## sample estimates:
## mean of the differences 
##                11.91603
  • Exercise: John et al. (2019) investigated the consequences of backing down (changing one’s mind in lights of evidence)and how other people view someone who change their mind. In their second experiments, they presented participants either with a person who changes their mind or a person who refuses to back down. Then, they asked participants to rate how intelligent and confident the person is (See the original study here). They reported that:

“Relative to the entrepreneur who did not back down, participants judged the entrepreneur who backed down as more intelligent (M_backed_down=5.13 out of 7, SD=1.09; M_did_not_back_down=3.97, SD=1.54; t(271.12)=−7.59, p < .001) but less confident (M_backed_down=4.50 out of 7, SD=1.36; M_did_not_back_down=5.65, SD=1.10; t(291.01)=8.08, p < .001).”.

Open the john_backdown_exp2.csv file and try to reproduce their results. Run two separate independent t-test, one with intelligent as the dependent variable and one with confident as the dependent variable. For both t-test, use back_down as the between-subject independent variable.

john_data <- read_csv(here("cleaned_data","john_backdown_exp2.csv"))


t.test(intelligent~back_down, data = john_data, paired=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  intelligent by back_down
## t = 7.5853, df = 271.12, p-value = 0.0000000000005319
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.8577107 1.4590076
## sample estimates:
##       mean in group backed_down mean in group did_not_back_down 
##                        5.129412                        3.971053
t.test(confident~back_down, data = john_data, paired=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  confident by back_down
## t = -8.0763, df = 291.01, p-value = 0.00000000000001787
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.4257768 -0.8670294
## sample estimates:
##       mean in group backed_down mean in group did_not_back_down 
##                        4.503268                        5.649671

5.2 Analysis of Variance (ANOVA)

Now, let’s analysis our main experiment data: Do participants in the CBT group show better outcome compared to participants in the Psychodynamic group? Suppose we believe that participants should show lower depression after 5 or 10 sessions of both psychotherapy treatments and this decrease should be more pronounced for CBT than psychodynamic psychotherapy. If this is the case. we expect an interaction in the traditional Analysis of Variance (AONVA) test.

aov_m1 <- aov_car (depression_score ~ group*stage +
                     Error(subject/stage), data = data_exp1)  
Effect df MSE F ges p.value
group 1, 129 737.60 27.08 *** .066 <.001
stage 2.97, 382.72 492.81 53.15 *** .215 <.001
group:stage 2.97, 382.72 492.81 8.91 *** .044 <.001

As you can see, we found a significant main effect of stage and a significant group by stage interaction. We can use the emmeans package to do post-hoc tests.

# main effect of stage
emmeans(aov_m1, 'stage')
##  stage  emmean   SE  df lower.CL upper.CL
##  stage1   53.3 1.83 579     49.7     56.9
##  stage2   33.3 1.83 579     29.7     36.9
##  stage3   26.3 1.83 579     22.7     29.9
##  stage4   21.4 1.83 579     17.8     25.0
##  stage5   31.4 1.83 579     27.8     35.0
## 
## Results are averaged over the levels of: group 
## Warning: EMMs are biased unless design is perfectly balanced 
## Confidence level used: 0.95
pairs(emmeans(aov_m1, 'stage'), adjust= 'holm')
##  contrast        estimate   SE  df t.ratio p.value
##  stage1 - stage2    20.03 2.36 516  8.480  <.0001 
##  stage1 - stage3    26.94 2.36 516 11.404  <.0001 
##  stage1 - stage4    31.91 2.36 516 13.506  <.0001 
##  stage1 - stage5    21.84 2.36 516  9.245  <.0001 
##  stage2 - stage3     6.91 2.36 516  2.924  0.0144 
##  stage2 - stage4    11.87 2.36 516  5.027  <.0001 
##  stage2 - stage5     1.81 2.36 516  0.766  0.4442 
##  stage3 - stage4     4.97 2.36 516  2.102  0.0941 
##  stage3 - stage5    -5.10 2.36 516 -2.158  0.0941 
##  stage4 - stage5   -10.07 2.36 516 -4.261  0.0001 
## 
## Results are averaged over the levels of: group 
## P value adjustment: holm method for 10 tests
# group by stage interaction
emmeans(aov_m1, "group", by= "stage")
## stage = stage1:
##  group         emmean   SE  df lower.CL upper.CL
##  CBT             53.5 2.59 577    48.47     58.6
##  Psychodynamic   53.0 2.60 580    47.92     58.1
## 
## stage = stage2:
##  group         emmean   SE  df lower.CL upper.CL
##  CBT             30.7 2.59 577    25.58     35.7
##  Psychodynamic   35.9 2.60 580    30.75     41.0
## 
## stage = stage3:
##  group         emmean   SE  df lower.CL upper.CL
##  CBT             21.7 2.59 577    16.62     26.8
##  Psychodynamic   31.0 2.60 580    25.89     36.1
## 
## stage = stage4:
##  group         emmean   SE  df lower.CL upper.CL
##  CBT             13.4 2.59 577     8.29     18.4
##  Psychodynamic   29.4 2.60 580    24.29     34.5
## 
## stage = stage5:
##  group         emmean   SE  df lower.CL upper.CL
##  CBT             18.8 2.59 577    13.74     23.9
##  Psychodynamic   44.1 2.60 580    38.96     49.2
## 
## Warning: EMMs are biased unless design is perfectly balanced 
## Confidence level used: 0.95
update(pairs(emmeans(aov_m1, "group", by= "stage")), by = NULL, adjust = "holm") 
##  contrast            stage  estimate   SE  df t.ratio p.value
##  CBT - Psychodynamic stage1    0.529 3.67 579  0.144  0.8852 
##  CBT - Psychodynamic stage2   -5.195 3.67 579 -1.417  0.3138 
##  CBT - Psychodynamic stage3   -9.288 3.67 579 -2.534  0.0346 
##  CBT - Psychodynamic stage4  -16.022 3.67 579 -4.371  0.0001 
##  CBT - Psychodynamic stage5  -25.244 3.67 579 -6.887  <.0001 
## 
## P value adjustment: holm method for 5 tests

You can use the afex_plot function from afex to create beautiful plots. Those plots interacts nicely with ggplot:

afex_plot(aov_m1, x = "stage", trace = "group", error='between',
          line_arg = list(size=1),
          point_arg = list(size=3.5),
          data_arg = list(size= 1, color= 'grey', width=.4),
          data_geom = geom_boxplot,
          mapping = c("linetype", "shape", "fill"),
          legend_title = "Group") +
  labs(y = "Depression Score", x = "") +
  theme_bw()+ # remove the grey background and grid
  theme(axis.text=element_text(size=13),
        axis.title = element_text(size = 13),
        legend.text=element_text(size=13),
        legend.title=element_text(size=13),
        legend.position='bottom',
        legend.key.size = unit(1, "cm"),
        legend.background = element_rect(colour = 'black', fill = 'white', linetype='solid'))+
  scale_color_simpsons() +
  scale_fill_simpsons()

If you are interested in this topic, check out this nice tutorial about using afex to run ANOVA, and also this interesting tutorial on the emmeans package.

  • Exercise: Rotello et al. (2018) investigated the association between the race (White vs. Black faces) and the gun-tool judgments. In their first experiments, they presented participants with 16 White male faces and 16 Black male faces, and following that 8 images of guns and 8 images of tools. They asked participants to judge if the object is a tool or a gun by pressing keyboard buttons. Then, they ran an ANOVA to see if participants’ gun responses are higher for any of the races. So, they included prime race (Black, White) and target identity (gun, tool) as independent variables and participants’ gun responses as dependent variable into their linear model (See the original study here). They found that:

“Participants made more gun responses to guns than to tools, F(1,45) = 53243, p < 0.0001, η2g = 0.998. However, the race of the prime face did not matter, F(1,45) = 0.287, p > 0.59, η2g = 0.001, nor was there an interaction of prime race with target object, F(1,45) = 0.022, p > 0.88, η2g = 0.000)”.

Open the rotello_shooter_exp1.csv file and try to reproduce their results. Run an ANOVA (type III) with resp as the dependent variable and target, prime, and their interaction as independent variables.

# load the general data file
rotello_data <- read_csv(here("cleaned_data","rotello_shooter_exp1.csv"))

# ANOVA
rotello_aov <- aov_car (resp ~ target*prime +
           Error(subject/target*prime), data = rotello_data)
Effect df MSE F ges p.value
target 1, 45 0.00 53242.99 *** .998 <.001
prime 1, 45 0.00 0.29 .001 .595
target:prime 1, 45 0.00 0.02 <.001 .883

5.3 Correlation

Here, we want to check the correlation between variables on the narcissism_data. First, we need to remove subject column because it is not numeric:

narcissism_data_cor <- narcissism_data %>%
  select(-subject)
#-- Base R:
cor(narcissism_data_cor, method = "pearson",  use = "complete.obs")

#-- Psych library:
psych::pairs.panels(narcissism_data_cor, method = "pearson", hist.col = "#00AFBB", density = T, ellipses = F, stars = T)

#-- Correlation library:
# install.packages("devtools")
# devtools::install_github("easystats/correlation")
#library("correlation")
correlation::correlation(narcissism_data_cor) %>% summary()

#-- apaTables library:
narcissism_data_cor %>% 
  apaTables::apa.cor.table(filename="./outputs/CorMatrix.doc", show.conf.interval=T)
psychopathy self_esteem narcissism mental_health
psychopathy 1.00 0.15 0.40 -0.44
self_esteem 0.15 1.00 0.11 -0.29
narcissism 0.40 0.11 1.00 -0.26
mental_health -0.44 -0.29 -0.26 1.00
Parameter mental_health narcissism self_esteem
psychopathy -0.44 0.40 0.15
self_esteem -0.29 0.11
narcissism -0.26
  • Exercise: Pennycook et al. (2020) investigated the relationship between actively open-minded thinking style about evidence (AOT-E) and different political, scientific, and religious beliefs (see the original paper here). In their first experiment, they calculated the correlation of AOTE and scientific beliefs items (global warming, evolution, etc.) and they found the following results:

Open the pennycook_aote_exp1.csv file and try to reproduce their results by creating the same correlation matrix.

pennycook_data <- read_csv(here("cleaned_data","pennycook_aote_exp1.csv")) 


#---------- Base R:
cor(pennycook_data, method = "pearson",  use = "complete.obs")

#---------- Psych library:
pennycook_data %>% 
  psych::pairs.panels(method = "pearson", hist.col = "#00AFBB", density = T, ellipses = F, stars = T)

#---------- Correlation library:
correlation::correlation(pennycook_data) %>% summary()

#---------- apaTables library:
pennycook_data %>% 
  apaTables::apa.cor.table(filename="./outputs/CorMatrix.doc", show.conf.interval=T)
Parameter trust_scien gm_health tech_problems modern_medicine old_earth vaccines stem_cell big_bang evolution global_warming
aote 0.35 0.36 0.44 0.33 0.40 0.47 0.45 0.51 0.51 0.37
global_warming 0.42 0.06 0.14 0.18 0.33 0.26 0.31 0.33 0.38
evolution 0.48 0.33 0.28 0.36 0.47 0.39 0.54 0.78
big_bang 0.49 0.37 0.28 0.36 0.45 0.37 0.54
stem_cell 0.47 0.34 0.36 0.47 0.40 0.40
vaccines 0.43 0.52 0.49 0.53 0.38
old_earth 0.29 0.24 0.21 0.33
modern_medicine 0.43 0.42 0.47
tech_problems 0.33 0.39
gm_health 0.31

5.4 Linear Regression

Here, we do single and multiple linear regreassion on the narcissism_data:

m1 <- lm(mental_health~narcissism, data= narcissism_data)
term estimate std.error statistic p.value
(Intercept) 4.86 0.56 8.75 0
narcissism -0.04 0.01 -3.04 0
m2 <- lm(mental_health~narcissism+psychopathy, data= narcissism_data)
term estimate std.error statistic p.value
(Intercept) 5.43 0.53 10.27 0.00
narcissism -0.02 0.01 -1.09 0.28
psychopathy -0.19 0.04 -4.71 0.00
  • Exercise: Trémolière and Djeriouat (2020) examined the role of cognitive reflection and belief in science in climate change skepticism. In their first study, they revealed that cognitive reflection and belief in science negetively predicted climate change skepticism even after controlling for demographic and cognitive ability variables (see the original paper here).

Open the tremoliere_data_exp1.csv file and try to reproduce their results by running a multiple linear regression. Enter age, gender, education, belief in science, literacy, numeracy (Numtotal), and cognitive reflection as predictors and enter climate change skepticism (climato) as the outcome variable.

Tremoliere_data <- read_csv(here("cleaned_data","tremoliere_data_exp1.csv"))

Tremoliere_reg=lm(Climato ~ Age+ Gender+ Education+ BeliefInSciencetotal+ Literacy+ Numtotal+ CognitiveReflection,
                    data=Tremoliere_data)
term estimate std.error statistic p.value
(Intercept) 57.57 5.19 11.09 0.00
Age 0.01 0.05 0.24 0.81
Gender -5.68 1.34 -4.23 0.00
Education 0.54 0.38 1.43 0.15
BeliefInSciencetotal -0.20 0.06 -3.62 0.00
Literacy -0.49 0.51 -0.96 0.34
Numtotal -1.52 0.83 -1.82 0.07
CognitiveReflection -18.58 4.26 -4.37 0.00
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.19 0.17 12.65 11.91 0 7 -1467.77 2953.54 2988.81 58235.89 364 372

6 Rmarkdown

To be completed…

7 References

  • Ghasemi, O., Handley, S., & Howarth, S. (2020). The Bright Homunculus in our Head: Individual Differences in Intuitive Sensitivity to Logical Validity.

  • John, L. K., Jeong, M., Gino, F., & Huang, L. (2019). The self-presentational consequences of upholding one’s stance in spite of the evidence. Organizational Behavior and Human Decision Processes, 154, 1-14.

  • Pennycook, G., Cheyne, J. A., Koehler, D. J., & Fugelsang, J. A. (2020). On the belief that beliefs should change according to evidence: Implications for conspiratorial, moral, paranormal, political, religious, and science beliefs. Judgment and Decision Making, 15(4), 476.

  • Rotello, C. M., Kelly, L. J., Heit, E., Vazire, S., & Vul, E. (2018). The Shape of ROC Curves in Shooter Tasks: Implications for Best Practices in Analysis. Collabra: Psychology, 4(1).

  • Trémolière, B., & Djeriouat, H. (2020). Don’t you see that its cold! Exploring the roles of cognitive reflection, climate science literacy, illusion of knowledge, and political orientation in climate change skepticism.

  • Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10), 1-23.

---
title: "R for Data Analysis"
author:
  - name: "Omid Ghasemi"
    affiliation: Macquarie University
    email: omidreza.ghasemi@hdr.mq.edu.au
  - name: "Mahdi Mazidi"
    affiliation: University of Western Australia
    email: mahdi.mazidisharafabadi@research.uwa.edu.au
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: 
  html_document:
    keep_md: yes
    number_sections: true
    theme: cerulean
    code_download: true
    #code_folding: hide
    toc: true
    toc_float: true
    df_print: "kable"
---

This document is the summary of the **R for Data Analysis** workshop. 

All correspondence related to this document should be addressed to: 

<center>
Omid Ghasemi (Macquarie University, Sydney, NSW, 2109, AUSTRALIA) 

Email: omidreza.ghasemi@hdr.mq.edu.au 
</center>



<style>

body{ /* Normal  */
      text-align: justify;
      font-family: "Times New Roman", Times, serif;
}
code.r{ /* Code block */
    font-size: 14px;
}
pre { /* Code block - determines code spacing between lines */
    font-size: 12px;
}

</style>


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(fig.align="center")
```



```{r libraries, message=FALSE, echo=F}
# load libraries
library(tidyverse)
library(here)
library(janitor)
library(broom)
library(afex)
library(emmeans)
library(knitr)
library(kableExtra)
library(ggsci)
library(patchwork)
library(skimr)
# install.packages("devtools")
# devtools::install_github("easystats/correlation")
library("correlation")
options(scipen=999) # turn off scientific notations
options(contrasts = c('contr.sum','contr.poly')) # set the contrast sum globally 
options(knitr.kable.NA = '')
```


# Introduction to R

## Basics and Variables

```{r echo=FALSE, out.width="700px", out.height="700px", fig.cap= "Artwork by Allison Horst: https://github.com/allisonhorst/stats-illustrations"}
knitr::include_graphics(here('inputs','r_first_then.png'))
```


R can be used as a calculator. For mathematical purposes, be careful of the order in which R executes the commands.

```{r}
10 + 10

4 ^ 2

(250 / 500) * 100
```

R is a bit flexible with spacing (but no spacing in the name of variables and words)

```{r}
10+10

10                 +           10
```

R can sometimes tell that you're not finished yet

```{r eval=F}
10 +
```

How to create a *variable*? Variable assignment using `<-` and `=`. Note that R is case sensitive for everything

```{r}
pay <- 250

month = 12

pay * month

salary <- pay * month
```


Few points in naming variables and vectors: use short, informative words, keep same method (e.g., you can use capital letters but it is not recommended, use only _ or . ).

## Function 
Function is a set of statements combined together to perform a specific task. When we use a block of code repeatedly, we can convert it to a function. To write a function, first, you need to *define* it:

```{r}
my_multiplier <- function(a,b){
  result = a * b
  return (result)
}
```

This code do nothing. To get a result, you need to *call* it:

```{r}
my_multiplier (a=2, b=4)
# or: my_multiplier (2, 4)
```

We can set a default value for our arguments:

```{r}
my_multiplier2 <- function(a,b=4){
  result = a * b
  return (result)
}

my_multiplier2 (a=2)
# or: my_multiplier (2)
# or: my_multiplier (2, 6)
```

Fortunately, you do not need to write everything from scratch. R has lots of built-in functions that you can use:
```{r}
round(54.6787)
round(54.5787, digits = 2)
```

Use `?` before the function name to get some help. For example, `?round`. You will see many functions in the rest of the workshop.

## Data Types

function `class()` is used to show what is the type of a variable.


1. *Logical*: `TRUE`, `FALSE` can be abbreviated as `T`, `F`.  They has to be capital, 'true' is not a logical data:
```{r}
class(TRUE)
class(F)
```

2. *Numeric*: all numbers e.g. 5,  10.5,  11,37;  a special type of numeric is "integer" which is numbers without decimal. Integers are always numeric, but numeric is not always integer:
```{r}
class(2)
class(13.46)
```

3. *Character*: text for example, "I love R" or "4" or "4.5":
```{r}
class("ha ha ha ha")
class("56.6")
class("TRUE")
```

Can we change the type of data in a variable? Yes, you need to use the function `as.---()`

```{r}
as.numeric(TRUE)
as.character(4)
as.numeric("4.5")
as.numeric("Hello")
```


## Data Structures


### Vector 

When there are more than one number or letter stored. Use the combine function c() for that.

```{r}
sale <- c(1, 2, 3,4, 5, 6, 7, 8, 9, 10) # also sale <- c(1:10)

sale <- c(1:10)

sale * sale
```

*Subsetting a vector*:

```{r}
days <- c("Saturday", "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday")

days[2]
days[-2]

days[c(2, 3, 4)]
```


* *Exercise*: Create a vector named `my_vector` with numbers from 0 to 1000 in it and calculate mean, median, sd, min, max, and sum of that vector:

```{r}
my_vector <- (0:1000)

mean(my_vector)
median(my_vector)
min(my_vector)
range(my_vector)
class(my_vector)
sum(my_vector)
sd(my_vector)
```

### List

List allows you to gather a variety of objects under one name (that is, the name of the list) in an ordered way. These objects can be matrices, vectors, data frames, even other list.

```{r}
my_list = list(sale, 1, 3, 4:7, "HELLO", "hello", FALSE)
my_list
```

### Factor
Factors store the vector along with the distinct values of the elements in the vector as labels. The labels are always character irrespective of whether it is numeric or character. For example, variable gender with "male" and "female" entries:

```{r}
gender <- c("male", "male", "male", " female", "female", "female")
gender <- factor(gender)
```

R now treats gender as a nominal (categorical) variable: 1=female, 2=male internally (alphabetically).
```{r}
summary(gender)
```

* *Question*: why when we ran the above function i.e. summary(), it showed three and not two levels of the data? *Hint*: run 'gender'.

```{r}
gender
```

So, be careful of spaces!

* *Exercise*: Create a gender factor with 30 male and 40 females (*Hint*: use the `rep()` function):
```{r}
gender <- c(rep("male",30), rep("female", 40))
gender <- factor(gender)
gender
```

There are two types of categorical variables: nominal and ordinal. How to create ordered factors (when the variable is nominal and values can be ordered)? We should add two additional arguments to the `factor()` function: `ordered = TRUE`, and `levels = c("level1", "level2")`. For example, we have a vector that shows participants' education level.

```{r}
edu<-c(3,2,3,4,1,2,2,3,4)

education<-factor(edu, ordered = TRUE)
levels(education) <- c("Primary school","high school","College","Uni graduated")
education
```

* *Exercise*: We have a factor with `patient` and `control` values. Here, the first level is control and the second level is patient. Change the order of levels, so patient would be the first level:

```{r}
health_status <- factor(c(rep('patient',5),rep('control',5)))
health_status

health_status_reordered <- factor(health_status, levels = c('patient','control'))
health_status_reordered
```

Finally, can you relabel both levels to uppercase characters? (*Hint*: check `?factor`)

```{r}
health_status_relabeled <- factor(health_status, levels = c('patient','control'), labels = c('Patient','Control'))
health_status_relabeled
```


### Matrices
All columns in a matrix must have the same mode(numeric, character, etc.) and the same length. It can be created using a vector input to the matrix function.

```{r}
my_matrix = matrix(c(1,2,3,4,5,6,7,8,9), nrow = 3, ncol = 3)

my_matrix
```

### Data frames 

Data frames can hold numeric, character or logical values. Within a column all elements have the same data type, but different columns can be of different data type. Let's create a dataframe:

```{r}
id <- 1:200
group <- c(rep("Psychotherapy", 100), rep("Medication", 100))
response <- c(rnorm(100, mean = 30, sd = 5),
             rnorm(100, mean = 25, sd = 5))

my_dataframe <-data.frame(Patient = id,
                          Treatment = group,
                          Response = response)
```

We also could have done the below

```{r}
my_dataframe <-data.frame(Patient = c(1:200),
                          Treatment = c(rep("Psychotherapy", 100), rep("Medication", 100)),
                          Response = c(rnorm(100, mean = 30, sd = 5),
                                       rnorm(100, mean = 25, sd = 5)))
```

In large data sets, the function head() enables you to show the first observations of a data frames. Similarly, the function tail() prints out the last observations in your data set.

```{r eval=F}
head(my_dataframe) 
tail(my_dataframe)
```

```{r echo=F}
head(my_dataframe) %>%
  mutate(` `= c(1:6)) %>%
  select(` `, Patient, Treatment,	Response) %>%
  knitr::kable() %>%
  kable_styling(bootstrap_options = c("striped", "bordered", "condensed"), fixed_thead = T, full_width = T)

tail(my_dataframe)%>%
  knitr::kable() %>%
  kable_styling(bootstrap_options = c("striped", "bordered", "condensed"), fixed_thead = T, full_width = T)
```

Similar to vectors and matrices, brackets [] are used to selects data from rows and columns in data.frames:

```{r}
my_dataframe[35, 3]
```

* *Exercise*: How can we get all columns, but only for the first 10 participants?

```{r eval=F}
my_dataframe[1:10, ]
```

```{r echo=F}
knitr::kable(my_dataframe[1:10, ]) %>%
  kable_styling(bootstrap_options = c("striped", "bordered", "condensed"), fixed_thead = T, full_width = T)

```
How to get only the Response column for all participants?

```{r}
my_dataframe[ , 3]
```

Another easier way for selecting particular items is using their names that is more helpful than number of the rows in large data sets:
```{r eval=F}
my_dataframe[ , "Response"]
# OR:
my_dataframe$Response

```

So far, we created dataframes using `data.frame` function from the base R. However, a better way to create dataframes is to use the `tibble` function from tidyverse (see [here](https://r4ds.had.co.nz/tibbles.html)).

# Data Cleaning

```{r echo=FALSE, out.width="700px", out.height="350px", fig.cap= "Artwork by Allison Horst: https://github.com/allisonhorst/stats-illustrations"}
knitr::include_graphics(here('inputs','environmental-data-science-r4ds-general.png'))
```


```{r echo=FALSE, out.width="700px", out.height="350px", fig.cap= "Artwork by Allison Horst: https://github.com/allisonhorst/stats-illustrations"}
knitr::include_graphics(here('inputs','cracked_setwd.png'))
```

Now, suppose we ran an experiment with 141 depressed patients. Participants were randomly assigned into two treatment groups: CBT or Psychodynamic psychotherapy. We measured self-report depression scores at 5 different stages of treatment: 

- Stage 1: Before starting any treatment. It is our base stage (pre-test)
- Stage 2: After 5 sessions of psychotherapy (post-test1)
- Stage 3: After 10 sessions of psychotherapy (post-test2)
- Stage 4: At the end of the treatment (post-test3)
- Stage 5: Three months after the treatment (post-test4)

let's read and check the uncleaned data. But, first thing first. let's install and then load the tidyvese package. We also need some other packages:

```{r message=F, warning=F, eval=F}

# Install it
install.packages("tidyverse")

# And then load it
library(tidyverse)

# Load other packages that you have already installed
library(here)
library(janitor)
library(broom)
library(afex)
library(emmeans)
library(knitr)
library(kableExtra)
library(ggsci)
library(patchwork)
library(skimr)
# install.packages("devtools")
# devtools::install_github("easystats/correlation")
library("correlation")
options(scipen=999) # turn off scientific notations
options(contrasts = c('contr.sum','contr.poly')) # set the contrast sum globally 
options(knitr.kable.NA = '')
```

```{r message=F, warning=F, eval=F}
# read the raw data
raw_data <- read_csv(here("raw_data","raw_data_exp1.csv"))
head(raw_data)
```

```{r message=F, warning=F, echo=F}
# read the raw data
raw_data <- read_csv(here("raw_data","raw_data_exp1.csv"))

knitr::kable(head(raw_data)) %>%
  kable_styling(bootstrap_options = c("striped", "bordered", "condensed"), fixed_thead = T, full_width = F)%>%
  scroll_box(width = "780px")
```

* *Exercise*: There is a dataset in the `cleaned_data` folder named `unicef_u5mr.csv`. Read the dataset using `read_csv` and `here`.
```{r message=F, warning=F}
unicef_data <- read_csv(here("cleaned_data","unicef_u5mr.csv"))
```

```{r echo=FALSE, out.width="700px", out.height="350px", fig.cap= "Artwork by Allison Horst: https://github.com/allisonhorst/stats-illustrations"}
knitr::include_graphics(here('inputs','tidydata_3.jpg'))
```

In order to clean the data, we use *tidyverse* which is a collection of packages to work with data. One of the tidyverse packages that we use regularly is `dplyr` which includes several functions:

- `mutate()` adds new variables or change existing ones.
- `select()` pick variables (columns) based on their names.
- `filter()` picks cases (rows) based on their values.
- `summarise()` gives a single single summary of the data (e.g., mean, counts, etc.)
- `arrange()` changes the ordering of the rows.
- `group_by()` divides your dataframe into grouped dataframes and allow you to do each of the above operations (except for `arrange`) on every one of them separately.

## Select

Pick `subject`, `age`, and `gender` columns:

```{r message=F, warning=F, eval=F}
selected_data <- select(raw_data, subject, age, gender)
```

## Filter
Now, do the following tasks: pick all the male participants, pick all the male participants **or** those greater than 25 years old, and finally pick all male participants **and** those greater than 25 years old:

```{r message=F, warning=F, eval=F}
# filter all males
fil_male <- filter(raw_data, gender == "Male")
# filter males and older than 25
fil_male_and_g25 <- filter(raw_data, gender == "Male" & age > 25 )
# filter males or older than 25
fil_male_or_g25 <- filter(raw_data, gender == "Male" | age > 25 )
```

## Arrange 
Arrange (order) your dataframe based on the age, once in an ascending order (youngers first) and once based on descending order (olders first):

```{r message=F, warning=F}
# order participants based on their age
arranged_data <- arrange(raw_data, age)
# order participants based on their age (descendeing)
arranged_descending <- arrange(raw_data, desc(age))
```

## Mutate
Create a column to show if the participant has finished the task or not:
```{r message=F, warning=F}
mutated_data <- mutate (raw_data, finished= case_when(progress==100~ "Yes",T~ "No"))
```

## Summarise
Summarize participants age and sd:
```{r message=F, warning=F}
summarise(raw_data, mean= mean(age, na.rm=T),
          sd= sd (age, na.rm=T))
```

## Pipe Operators
A new function: **pipe operators** `%>%` pipes a value into the next function:

```{r message=F, warning=F}
raw_data %>% 
  summarise(., mean= mean(age, na.rm=T),
            sd= sd (age, na.rm=T))
```


```{r message=F, warning=F}
raw_data %>% 
  summarise(mean= mean(age, na.rm=T),
            sd= sd (age, na.rm=T))
```

Calculate the age mean of younger than 25 participants only:

```{r message=F, warning=F}
raw_data %>% 
  filter (age < 25) %>%
  summarise(mean= mean(age, na.rm=T),
            sd= sd (age, na.rm=T))
```

## Group By

Calculate the age mean of younger than 25 participants  for each gender separately:

```{r message=F, warning=F}
raw_data %>% 
  filter (age < 25) %>%
  group_by(gender) %>%
  summarise(mean= mean(age, na.rm=T),
            sd= sd (age, na.rm=T)) %>%
  ungroup ()
```         


* *Exercise*: Create a column to show if participant is older than 23 or not and then calculate sleep quality (`sleep_quality`) mean for each group separately:
```{r message=F, warning=F}
raw_data %>%
  mutate(age_group = case_when(age > 23 ~ "greater than 23", T~ "younger than 23")) %>%
  group_by(age_group) %>%
  summarise(sleep_quality = mean(sleep_quality, na.rm=T))
```     

* *Exercise*: Add the anxiety total score (sum) to the dataframe and then convert subject column to factor:
```{r message=F, warning=F}
anxiety_data <- raw_data %>%
  mutate(anxiety_total= anxiety1+anxiety2+anxiety3+anxiety4+anxiety5+anxiety6+anxiety7+anxiety8) %>%
  mutate(subject= factor(subject))
``` 

## Pivoting

Next, we want to pivot our data to switch between long and wide format:

```{r echo=FALSE, out.width="700px", out.height="350px", fig.cap= "Artwork by Allison Horst: https://github.com/allisonhorst/stats-illustrations"}
knitr::include_graphics(here('inputs','tidydata_1.jpg'))
```

```{r message=F, warning=F}

# Make you data long
long_data <- raw_data %>%
  select(subject, stage1_cbt:stage5_cbt,stage1_dynamic:stage5_dynamic) %>%
  pivot_longer(cols = c(stage1_cbt:stage5_dynamic), names_to = 'stage', values_to = 'depression_score')

# Make you data wide
wide_data <- long_data %>%
  pivot_wider(names_from = stage, values_from= depression_score)

```

* *Exercise*: Convert the UNICEF dataset to long and wide formats:
```{r message=F, warning=F}
unicef_data <- read_csv(here("cleaned_data","unicef_u5mr.csv"))

library(janitor)
unicef_data_cleaned <- unicef_data %>%
  clean_names()

unicef_long_data <- unicef_data_cleaned %>% pivot_longer(cols = c(u5mr_1950:u5mr_2015), names_to = 'year', values_to = 'u5mr')
unicef_wideg_data <- unicef_long_data %>% pivot_wider(names_from = 'year', values_from = 'u5mr')
```

*Note*: The codes for the previous exercise were taken from [this blog post](https://sejdemyr.github.io/r-tutorials/basics/wide-and-long/) written by Simon Ejdemyr.

Now, let's do some cleaning using `dplyr`, `tidyr` and other `tidyverse` libraries: 
```{r message=F, warning=F, eval=F}
cleaned_data <- raw_data %>% 
  filter(progress == 100) %>% # filter out unfinished participants
  select(-consent_form) %>% #remove some useless columns
  # create a total score for our questionnaire
  mutate(anxiety_total= anxiety1+anxiety2+anxiety3+anxiety4+anxiety5+anxiety6+anxiety7+anxiety8) %>%
  select(-anxiety1:-anxiety8) %>%
  # make our dataframe long
  pivot_longer(cols = c(stage1_cbt:stage5_cbt,stage1_dynamic:stage5_dynamic),names_to = 'stage',values_to = 'depression_score') %>% 
  #pivot_wider(names_from = stage, values_from= depression_score) # this code change our dataframe back to wide
  filter(!is.na(depression_score)) %>% #remove rows with depression_score == NA
  mutate(stage= gsub("_.*", "", stage)) %>%
  select (subject, age, gender, group, stage, depression_score, anxiety_total, sleep_quality, life_satisfaction)
```


```{r message=F, warning=F, echo=F}
cleaned_data <- raw_data %>% 
  filter(progress == 100) %>% # filter out unfinished participants
  select(-consent_form) %>% #remove some useless columns
  # create a total score for our questionnaire
  mutate(anxiety_total= anxiety1+anxiety2+anxiety3+anxiety4+anxiety5+anxiety6+anxiety7+anxiety8) %>%
  select(-anxiety1:-anxiety8) %>%
  # make our dataframe long
  pivot_longer(cols = c(stage1_cbt:stage5_cbt,stage1_dynamic:stage5_dynamic),names_to = 'stage',values_to = 'depression_score') %>% 
  #pivot_wider(names_from = stage, values_from= depression_score) # this code change our dataframe back to wide
  filter(!is.na(depression_score)) %>% #remove rows with depression_score == NA
  mutate(stage= gsub("_.*", "", stage)) %>%
  select (subject, age, gender, group, stage, depression_score, anxiety_total, sleep_quality, life_satisfaction)

knitr::kable(head(cleaned_data)) %>%
  kable_styling(bootstrap_options = c("striped", "bordered", "condensed"), fixed_thead = T, full_width = F)%>%
  scroll_box(width = "780px")
```

Ok, now the data is clean and tidy which means:

> 1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table ([Wickham](https://vita.had.co.nz/papers/tidy-data.pdf), 2014).


Check the dataframe and all the data types:
```{r}
str(cleaned_data)
```

Finally, we save our data to the `cleaned_data` folder.

```{r}
write_csv(cleaned_data, here("cleaned_data","cleaned_data_exp1.csv"))
```


# Data Visualization

```{r echo=FALSE, out.width="700px", out.height="350px", fig.cap= "Artwork by Allison Horst: https://github.com/allisonhorst/stats-illustrations"}
knitr::include_graphics(here('inputs','ggplot2_masterpiece.png'))
```

Before starting the ggplot, let's try a visualization using a function from the Base R the plot() function shows the association of each variable against the other one in a data handy for data with few number of variables to see if there are any patterns

```{r message=F, warning=F, dpi= 300, fig.height=3, fig.width=4}

exam_data<- read_csv(here::here("cleaned_data", "exam_data.csv"))

plot(x = exam_data$Anxiety, y = exam_data$Exam)

```

The code also works without writing x and y, however, writing them is strongly recommended

```{r message=F, warning=F, dpi= 300, fig.height=2, fig.width=4}

plot(exam_data$Anxiety, exam_data$Exam)
```

`ggplot`, the gg in ggplot stands for grammar of graphics. Grammar of graphics basically says any graphical representation of data, can be produced by a series of layers. You can think of a layer as a plastic transparency. Lets draw the same plot using ggplot. Always, mention the data we are going to work with.
```{r message=F, warning=F, dpi= 300, fig.height=2, fig.width=4}
ggplot(data = exam_data, aes(x = Exam, y = Anxiety))
```


- `aes`: aes which stands for aesthetics is a relationship between a variable in your dataset and an aspect of the plot that is going to visually convey the information to the reader

- Visual elements are known as geoms (short for 'geometric objects') in ggplot 2. When we define a layer, we have to tell R what geom we want displayed on that layer (do we want a bar, line dot, etc.?)

```{r message=F, warning=F, dpi= 300, fig.height=2, fig.width=4}
ggplot(data = exam_data, aes(x = Exam, y = Anxiety))+ geom_point()
```

So, lets try some of them here like shape and size. Be careful with the + sign, if you clink enter for the next part of the code, the + sign should not go to the next line

```{r message=F, warning=F, dpi= 300, fig.height=2, fig.width=4}
ggplot(data = exam_data, aes(x = Exam, y = Anxiety))+
  geom_point(size = 2, shape = 8)
```

The current plot is not very informative about the patterns for each gender.
```{r message=F, warning=F, dpi= 300, fig.height=2, fig.width=4}
ggplot(data = exam_data, aes(x = Exam, y = Anxiety, color = Gender))+
  geom_point(size = 2, shape = 10)

ggplot(data = exam_data, aes(x = Exam, y = Anxiety, color = Gender, shape = Gender))+
  geom_point(size = 2, shape = 10)
```

Question: why the above code doesn't make any change?

```{r message=F, warning=F, dpi= 300, fig.height=2, fig.width=4}
ggplot(data = exam_data, aes(x = Exam, y = Anxiety, color = Gender, shape = Gender))+
  geom_point(size = 2)
```

Can assign the first layer to a variable to reduce the length of codes for next layers.

```{r message=F, warning=F, dpi= 300, fig.height=2, fig.width=4}
My_graph <- ggplot(data = exam_data, aes(x = Exam, y = Anxiety))

My_graph + geom_point()
```

lets add a line to the current graph
```{r message=F, warning=F, dpi= 300, fig.height=2, fig.width=4}
My_graph + geom_point() + geom_smooth()
```

Aesthetics can be set for all layers of the plot (i.e., defined in the plot as a whole) or can be set individually for each geom in a plot.

```{r message=F, warning=F, dpi= 300, fig.height=2, fig.width=4}
My_graph + geom_point(aes(color = Gender)) + geom_smooth()

My_graph + geom_point(aes(color = Gender)) + geom_smooth(aes(color = Gender))
```

The shaded area around the line is the 95% confidence interval around the line. We can switch this off by  adding `se = F` (which is short for 'standard error = False')

```{r message=F, warning=F, dpi= 300, fig.height=2, fig.width=4}
My_graph + geom_point() + geom_smooth(se = F)
```


What if we want our line to be a direct line?
```{r message=F, warning=F, dpi= 300, fig.height=2, fig.width=4}
My_graph + geom_point() + geom_smooth(se = F, method = lm)

```
How to change the labels of x and y axes?
```{r message=F, warning=F, dpi= 300, fig.height=2, fig.width=4}
My_graph + geom_point() + geom_smooth(se = F, method = lm) +
  labs(x = "Exam scores %", y = "Anxiety scores")
```

Histograms are used to show distributions of variables while bar charts are used to compare variables. Histograms plot quantitative data with ranges of the data grouped into bins or intervals while bar charts plot categorical data.

```{r message=F, warning=F, dpi= 300, fig.height=2, fig.width=4}
#ggplot(data = exam_data, aes(x = Anxiety, y = Exam )) + geom_histogram()
# the code above gives an error as geom_histogram can only have x or y axis in its aes()

ggplot(data = exam_data, aes(x = Anxiety)) + geom_histogram()

ggplot(data = exam_data, aes(y = Anxiety)) + geom_histogram()

ggplot(data = exam_data, aes(x = Anxiety)) + geom_histogram(bins = 31)

ggplot(data = exam_data, aes(x = Anxiety)) + geom_histogram(bins = 31, fill = "green")

ggplot(data = exam_data, aes(x = Anxiety)) + geom_histogram(bins = 31, fill = "green", col = "red")
```

Let's stop using the My_graph variable and write the whole code from the start again for a bar chart
```{r message=F, warning=F, dpi= 300, fig.height=2, fig.width=4}
ggplot(data = exam_data, aes(x = Sleep_quality))+
  geom_bar()
```
Because we want to plot a summary of the data (the mean) rather than the raw scores themselves, we have to use a stat.
```{r message=F, warning=F, dpi= 300, fig.height=2, fig.width=4}
ggplot(data = exam_data, aes(x = Sleep_quality, y = Exam, fill = Gender))+
  geom_bar(stat = "summary", fun = "mean")


ggplot(data = exam_data, aes(x = Sleep_quality, y = Exam, fill = Gender))+
  geom_bar(stat = "summary", fun = "mean", position = "dodge")
```

The other way to get the same plot that the code above gives, is using the stat_summary function that takes the following general form: `stat_summary(function = x, geom = y)`

```{r message=F, warning=F, dpi= 300, fig.height=2, fig.width=4}
ggplot(data = exam_data, aes(x = Sleep_quality, y = Exam, fill = Gender))+
  stat_summary(fun = mean, geom = "bar", position = "dodge")
```

How to combine multiple plots? How to combine multiple plots? We can use the `patchwork` package. A nice tutorial on using this package can be found [here](https://patchwork.data-imaginist.com/articles/patchwork.html)


```{r echo=FALSE, out.width="700px", out.height="350px", fig.cap= "Artwork by Allison Horst: https://github.com/allisonhorst/stats-illustrations"}
knitr::include_graphics(here('inputs','patchwork_1.jpg'))
```

```{r message=F, warning=F, dpi= 300, fig.height=5, fig.width=6}
p1 = My_graph + geom_point(aes(color = Gender)) + geom_smooth()

p2 = ggplot(data = exam_data, aes(x = Anxiety)) + geom_histogram(bins = 31)

p3 = ggplot(data = exam_data, aes(x = Sleep_quality, y = Exam, fill = Gender))+
  stat_summary(fun = mean, geom = "bar", position = "dodge")

p4 = My_graph + geom_point() + geom_smooth(se = F, method = lm) +
  labs(x = "Exam scores %", y = "Anxiety scores")

combined = p1 + p2+ p3 + p4 + plot_layout(nrow = 4, byrow = F)

combined

p1 | p2 / p3 / p4

p1 | p2 / (p3 / p4)
```


`ggsave()` function, which is a versatile exporting function that can export as PostScript (.eps/.ps), tex (pictex), pdf, jpeg, tiff, png, bmp, svg and wmf (in Windows only). In its basic form, the structure of the function is very simple: `ggsave(filename)`

```{r message=F, warning=F, dpi= 300, fig.height=3, fig.width=5}
ggsave(combined, filename = here("outputs", "combined.png"), dpi=300)
```


Now that we learned the basics of ggplot, let's draw some plot for our experiment data. First, we need to create a dataset with aggregated `depression_score` scores over `group` and `stage`. We will use this dataset for line and bar graphs.

```{r message=F, warning=F, dpi= 300, fig.height=3, fig.width=5}
library(ggsci)

data_exp1_orig <- read_csv(here("cleaned_data","cleaned_data_exp1.csv"))

data_exp1 <- data_exp1_orig%>% 
  #mutate_if(is.character, factor) %>%
  mutate(subject= factor(subject), # convert all characters to factor
         group = factor(group),
         stage = factor(stage))


aggregated_data_exp1 <- data_exp1 %>%
  group_by(stage, group) %>%
  mutate(depression_score = mean(depression_score)) %>%
  ungroup()


barplot_exp1 <- aggregated_data_exp1 %>%
  ggplot(aes(x=stage, y= depression_score, fill=group)) +
  geom_bar(stat = "identity", position= "dodge")+
  labs (x= '', y= "Depression Score") + 
  theme_bw() + 
  scale_fill_jama() 

#ggsave(barplot_exp1, filename = here("outputs","barplot_exp1.png"), dpi=300)


barplot_facet_exp1 <- aggregated_data_exp1 %>%
  ggplot(aes(x=group, y= depression_score, fill=stage)) +
  geom_bar(stat = "identity", position= "dodge")+
  labs (x= '', y= "Depression Score") + 
  theme_bw() + 
  theme(legend.position = "none",
        axis.text=element_text(size=11),
        axis.title = element_text(size = 12)) +
  facet_wrap(~stage, nrow = 1)+
  scale_fill_jco() 

#ggsave(barplot_facet_exp1, filename = here("outputs","barplot_facet_exp1.png"), dpi=300)


lineplot_exp1 <- aggregated_data_exp1 %>%
  ggplot(aes(x=factor(stage), y= depression_score, group= group, color= group)) +
  geom_line(aes(linetype= group)) +
  geom_point(size= 5)+
  labs (x= '', y= "Depression Score") + 
  theme_classic() +
  theme(legend.position = "bottom",
        axis.text=element_text(size=11),
        axis.title = element_text(size = 12)) +
  scale_color_nejm() 

#ggsave(lineplot_exp1, filename = here("outputs","lineplot_exp1.png"), dpi=300)


violinplot_exp1 <- data_exp1 %>%
  ggplot(aes(x=factor(stage), y= depression_score, fill= group)) +
  geom_violin()+
  labs (x= '', y= "Depression Score") + 
  theme_bw() + 
  theme(legend.position = "bottom",
        axis.text=element_text(size=11)) +
  scale_fill_d3() 

#ggsave(violinplot_exp1, filename = here("outputs","violinplot_exp1.png"), dpi=300)


boxplot_exp1 <- data_exp1 %>%
  ggplot(aes(x=factor(stage), y= depression_score, fill= group)) +
  geom_boxplot()+
  #geom_point(position = position_dodge(width=0.75), alpha= .5)+
  labs (x= '', y= "Depression Score") + 
  theme_bw() + 
  theme(legend.position = "bottom",
        axis.text=element_text(size=11)) +
  scale_fill_simpsons() 

#ggsave(boxplot_exp1, filename = here("outputs","boxplot_exp1.png"), dpi=300)


boxplot_facet_exp1 <- data_exp1 %>%
  ggplot(aes(x=factor(stage), y= depression_score, fill= group)) +
  geom_boxplot()+
  labs (x= '', y= "Depression Score") + 
  theme_bw() + 
  theme(legend.position = "bottom",
        axis.text=element_text(size=11),
        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  facet_wrap(~group)+
  scale_color_simpsons() 

#ggsave(boxplot_facet_exp1, filename = here("outputs","boxplot_facet_exp1.png"), dpi=300)

```

Let's combine our plots:

```{r dpi= 300, fig.height=7, fig.width=9}

combined_plot_exp1 <- barplot_facet_exp1 / (lineplot_exp1+violinplot_exp1+boxplot_exp1)
combined_plot_exp1
```

And here, we save our plots to the `outputs` folder.
```{rmessage=F}
ggsave(combined_plot_exp1, filename = here("outputs","combined_plot_exp1.png"), dpi=300, width = 12)
```

- [A complete guide to ggplot](https://www.cedricscherer.com/2019/08/05/a-ggplot2-tutorial-for-beautiful-plotting-in-r/)

# Descriptive Statistics

```{r echo=FALSE, out.width="700px", out.height="350px", fig.cap= "Artwork by Allison Horst: https://github.com/allisonhorst/stats-illustrations"}
knitr::include_graphics(here('inputs','not_normal.png'))
```

Now, let's do some descriptive statistics. Now, we can open a new script called `data_analysis.r` and read some datasets. Then we use `skimr` package to describe our data.

```{r message=F, warning=F,}
narcissism_data <- read_csv(here("cleaned_data","narcissism_data.csv"))
narcissism_data %>% skimr::skim()
```

* *Exercise*: Open the dataset called `treatment_data.csv` and do a descriptive analysis:
```{r message=F, warning=F}
treatment_data <- read_csv(here("cleaned_data","treatment_data.csv"))
treatment_data %>% skimr::skim()
```

* *Exercise*: Do the same thing for the `memory_data.csv`.

```{r message=F, warning=F}
memory_data <- read_csv(here("cleaned_data","memory_data.csv"))
memory_data %>% group_by(time) %>%
  skimr::skim()
```

Now, let's describe our experiment data:

```{r message=F, warning=F,}
data_exp1_orig <- read_csv(here("cleaned_data","cleaned_data_exp1.csv"))
```

How many participants in total?

```{r message=F, warning=F}
data_exp1 %>% summarise(n= n_distinct(subject))
```


```{r message=F, warning=F, echo=F, eval=F}
data_exp1 %>% summarise(n= n_distinct(subject))%>%
  knitr::kable() %>%
  kable_styling(bootstrap_options = c("striped", "bordered", "condensed"), full_width = F)
```

How many participants do we have in each group?
```{r message=F, warning=F, eval=F}
data_exp1 %>% 
  group_by(subject) %>% 
  filter(row_number()==1) %>% 
  ungroup () %>% 
  group_by(group) %>% 
  count() 
```

```{r message=F, warning=F, echo=F}
data_exp1 %>% group_by(subject) %>% filter(row_number()==1) %>% ungroup () %>% group_by(group) %>% count() %>%
  knitr::kable() %>%
  kable_styling(bootstrap_options = c("striped", "bordered", "condensed"), fixed_thead = T)
```

Find the mean and sd for numeric variables using base R `summary` function:

```{r}
data_exp1 %>% 
  group_by(subject) %>% 
  filter(row_number()==1) %>% 
  ungroup () %>%
  summary()
```

Alternatively, we can use `skimr` library:
```{r eval=F}
data_exp1 %>% 
  group_by(subject) %>% 
  filter(row_number()==1) %>% 
  ungroup () %>% 
  dplyr::select (age, depression_score, anxiety_total, sleep_quality, life_satisfaction) %>% 
  skimr::skim()
```

```{r echo=F}
data_exp1 %>% 
  group_by(subject) %>% 
  filter(row_number()==1) %>% 
  ungroup () %>% 
  dplyr::select (age, depression_score, anxiety_total, sleep_quality, life_satisfaction) %>% 
  skimr::skim() %>%
  knitr::kable() %>%
  kable_styling(bootstrap_options = c("striped", "bordered", "condensed"), fixed_thead = T, full_width = F)%>%
  scroll_box(width = "780px")
```


* *Exercise*: For this exercise, we use a dataset of one of my own studies. In this study, we asked participants to guess the physical brightness of reasoning arguments and then we gave a cognitive ability test. (See the original study [here](https://osf.io/ebxnf/)). Open `ghasemi_brightness_exp4.csv` file and answer to the following questions:

1. How many participants did we test in total?
2. Find out how many male and female we tested.
3. Calculate mean and sd for age and cognitive ability (`cog_ability`).


```{r warning=F, message=F}
ghasemi_data <- read_csv(here("cleaned_data","ghasemi_brightness_exp4.csv"))

ghasemi_data %>% summarise(n = n_distinct(participant)) # number of participants:200

ghasemi_data %>% group_by (participant) %>% filter (row_number()==1) %>% group_by (gender) %>% summarise(n= n()) %>% ungroup() # 183 female, 17 male

ghasemi_data %>% dplyr::select (age, cog_ability) %>% skimr::skim() # mean and sd for age and cognitive ability
```



# Data Analysis

## t-test

Now, we use the treatment data to run three different independent t-tests. Suppose we did an experiment to compare the effectiveness of CBT vs. Psychodynamic therapies in decreasing anxiety, and depression and also in improving life satisfaction:

```{r}
# t.test (indep)
t.test(anxiety~treatment, data= treatment_data)
t.test(depression~treatment, data= treatment_data)
t.test(life_satisfaction~treatment, data= treatment_data)
```

In another experiment, suppose we have created a method to boost memory. Then, we recruit some participants, do a memory pre-test, implement the method, and do a memory post-test, Now, we want to see whether our method have improved participants' memory: 

```{r}
# t.test (paired)
t.test(memory_score~time, data= memory_data, paired= T)
```

Now that we learned about t-test, let's perform this test on our dataset. Is there a difference between groups at the first stage? Ideally, we want participants' depresion score at the first stage to be similar for both groups because we have not started our treatment yet. Previous graphs showed us that depression scores of the CBT and Psychodynamic groups at this stage are pretty close. Let's test that using an **independent t-test** (because we have 2 independent groups):

```{r}
# Is there a difference between groups at the first stage?
data_exp1 %>% 
  group_by(group) %>% 
  filter(stage=='stage1') %>% 
  ungroup () %>%
  t.test(depression_score~group, data = ., paired=FALSE)
```

Now, we wonder if psychotherapy treatments were effective at all, regardless of the treatment method. So, we would like to test if depresion score at the forth stage are lower than scores at the stage 2? Since a pair of score at stage 2 and stage 4 is coming from a same person, we use **paired t-test**.

```{r}
# Is there a difference between ratings of stage2 and stage4?
data_exp1 %>% 
  filter(stage=='stage2' | stage=='stage4') %>% 
  ungroup () %>%
  t.test(depression_score~stage, data = ., paired=TRUE)
```


* *Exercise*: John et al. (2019) investigated the consequences of backing down (changing one's mind in lights of evidence)and how other people view someone who change their mind. In their second experiments, they presented participants either with a person who changes their mind or a person who refuses to back down. Then, they asked participants to rate how intelligent and confident the person is (See the original study [here](https://www.hbs.edu/faculty/Publication%20Files/John%20et%20al%20-%20self-presentational%20consequences_b85b2c43-a5b5-474c-9e2c-e9853b10727e.pdf)). They reported that: 

> "Relative to the entrepreneur who did not back down, participants judged the entrepreneur who backed down as more intelligent (M_backed_down=5.13 out of 7, SD=1.09; M_did_not_back_down=3.97, SD=1.54; t(271.12)=−7.59, p < .001) but less confident (M_backed_down=4.50 out of 7, SD=1.36; M_did_not_back_down=5.65, SD=1.10; t(291.01)=8.08, p < .001).".

Open the `john_backdown_exp2.csv` file and try to reproduce their results. Run two separate independent t-test, one with `intelligent` as the dependent variable and one with `confident` as the dependent variable. For both t-test, use `back_down` as the between-subject independent variable.

```{r message=F, warning=F}
john_data <- read_csv(here("cleaned_data","john_backdown_exp2.csv"))


t.test(intelligent~back_down, data = john_data, paired=FALSE)
t.test(confident~back_down, data = john_data, paired=FALSE)
```


## Analysis of Variance (ANOVA)

Now, let's analysis our main experiment data: Do participants in the CBT group show better outcome compared to participants in the Psychodynamic group? Suppose we believe that participants should show lower depression after 5 or 10 sessions of both psychotherapy treatments and this decrease should be more pronounced for CBT than psychodynamic psychotherapy. If this is the case. we expect an interaction in the traditional **Analysis of Variance (AONVA)** test.

```{r message=F, warning=F}
aov_m1 <- aov_car (depression_score ~ group*stage +
                     Error(subject/stage), data = data_exp1)  
```

```{r echo=F}
knitr::kable(nice(aov_m1)) %>%
  kable_styling(bootstrap_options = c("striped", "bordered", "condensed"), fixed_thead = T, full_width = T)
```

As you can see, we found a significant main effect of stage and a significant group by stage interaction. We can use the `emmeans` package to do post-hoc tests.

```{r warning=F, message=F}
# main effect of stage
emmeans(aov_m1, 'stage')
pairs(emmeans(aov_m1, 'stage'), adjust= 'holm')
```


```{r warning=F, message=F}
# group by stage interaction
emmeans(aov_m1, "group", by= "stage")
update(pairs(emmeans(aov_m1, "group", by= "stage")), by = NULL, adjust = "holm") 
```

You can use the `afex_plot` function from afex to create beautiful plots. Those plots interacts nicely with ggplot:
```{r message=F, warning=F, dpi= 300}
afex_plot(aov_m1, x = "stage", trace = "group", error='between',
          line_arg = list(size=1),
          point_arg = list(size=3.5),
          data_arg = list(size= 1, color= 'grey', width=.4),
          data_geom = geom_boxplot,
          mapping = c("linetype", "shape", "fill"),
          legend_title = "Group") +
  labs(y = "Depression Score", x = "") +
  theme_bw()+ # remove the grey background and grid
  theme(axis.text=element_text(size=13),
        axis.title = element_text(size = 13),
        legend.text=element_text(size=13),
        legend.title=element_text(size=13),
        legend.position='bottom',
        legend.key.size = unit(1, "cm"),
        legend.background = element_rect(colour = 'black', fill = 'white', linetype='solid'))+
  scale_color_simpsons() +
  scale_fill_simpsons()
```


If you are interested in this topic, check out this nice tutorial about [using afex to run ANOVA](https://cran.r-project.org/web/packages/afex/vignettes/afex_anova_example.html), and also this interesting tutorial on the [emmeans package](https://aosmith.rbind.io/2019/03/25/getting-started-with-emmeans/).

* *Exercise*: Rotello et al. (2018) investigated the association between the race (White vs. Black faces) and the gun-tool judgments. In their first experiments, they presented participants with 16 White male faces and 16 Black male faces, and following that 8 images of guns and 8 images of tools. They asked participants to judge if the object is a tool or a gun by pressing keyboard buttons. Then, they ran an ANOVA to see if participants' gun responses are higher for any of the races. So, they included prime race (Black, White) and target identity (gun, tool) as independent variables and participants' gun responses as dependent variable into their linear model (See the original study [here](https://online.ucpress.edu/collabra/article/4/1/32/112986/The-Shape-of-ROC-Curves-in-Shooter-Tasks)). They found that: 

> "Participants made more gun responses to guns than to tools, F(1,45) = 53243, p < 0.0001, η2g = 0.998. However, the race of the prime face did not matter, F(1,45) = 0.287, p > 0.59, η2g = 0.001, nor was there an interaction of prime race with target object, F(1,45) = 0.022, p > 0.88, η2g = 0.000)".

Open the `rotello_shooter_exp1.csv` file and try to reproduce their results. Run an ANOVA (type III) with `resp` as the dependent variable and target, prime, and their interaction as independent variables.


```{r message=F, warning=F}
# load the general data file
rotello_data <- read_csv(here("cleaned_data","rotello_shooter_exp1.csv"))

# ANOVA
rotello_aov <- aov_car (resp ~ target*prime +
           Error(subject/target*prime), data = rotello_data)
```

```{r echo=F}
knitr::kable(nice(rotello_aov)) %>%
  kable_styling(bootstrap_options = c("striped", "bordered", "condensed"), fixed_thead = T, full_width = T)
```



## Correlation

Here, we want to check the correlation between variables on the `narcissism_data`. First, we need to remove `subject` column because it is not numeric:
```{r message=F, warning=F}
narcissism_data_cor <- narcissism_data %>%
  select(-subject)
```

```{r message=F, eval=F, fig.align='center', dpi=300}

#-- Base R:
cor(narcissism_data_cor, method = "pearson",  use = "complete.obs")

#-- Psych library:
psych::pairs.panels(narcissism_data_cor, method = "pearson", hist.col = "#00AFBB", density = T, ellipses = F, stars = T)

#-- Correlation library:
# install.packages("devtools")
# devtools::install_github("easystats/correlation")
#library("correlation")
correlation::correlation(narcissism_data_cor) %>% summary()

#-- apaTables library:
narcissism_data_cor %>% 
  apaTables::apa.cor.table(filename="./outputs/CorMatrix.doc", show.conf.interval=T)
```

```{r message=F, echo=F, fig.align='center', dpi=300}

#-- Base R:
cor(narcissism_data_cor, method = "pearson",  use = "complete.obs")%>%
  knitr::kable(digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "bordered", "condensed"), fixed_thead = T, full_width = F)

#-- Psych library:
psych::pairs.panels(narcissism_data_cor, method = "pearson", hist.col = "#00AFBB", density = T, ellipses = F, stars = T)

#-- Correlation library:
# install.packages("devtools")
# devtools::install_github("easystats/correlation")
#library("correlation")
correlation::correlation(narcissism_data_cor) %>% summary()%>%
  knitr::kable(digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "bordered", "condensed"), fixed_thead = T, full_width = F)

```


* *Exercise*: Pennycook et al. (2020) investigated the relationship between actively open-minded thinking style about evidence (AOT-E) and different political, scientific, and religious beliefs (see the original paper [here](https://psyarxiv.com/a7k96)). In their first experiment, they calculated the correlation of AOTE and scientific beliefs items (global warming, evolution, etc.) and they found the following results:

```{r echo=FALSE, out.width="700px", out.height="350px", fig.cap= "adapted from [Pennycook et al. (2020)](https://psyarxiv.com/a7k96)"}
knitr::include_graphics(here('inputs','pennycook_corr.png'))
```

Open the `pennycook_aote_exp1.csv` file and try to reproduce their results by creating the same correlation matrix.

```{r message=F, eval=F}
pennycook_data <- read_csv(here("cleaned_data","pennycook_aote_exp1.csv")) 


#---------- Base R:
cor(pennycook_data, method = "pearson",  use = "complete.obs")

#---------- Psych library:
pennycook_data %>% 
  psych::pairs.panels(method = "pearson", hist.col = "#00AFBB", density = T, ellipses = F, stars = T)

#---------- Correlation library:
correlation::correlation(pennycook_data) %>% summary()

#---------- apaTables library:
pennycook_data %>% 
  apaTables::apa.cor.table(filename="./outputs/CorMatrix.doc", show.conf.interval=T)
```


```{r message=F, eval=T, echo=F, fig.align='center', dpi=300}
pennycook_data <- read_csv(here("cleaned_data","pennycook_aote_exp1.csv")) %>%
  clean_names()

correlation::correlation(pennycook_data) %>% summary() %>%
  knitr::kable(digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "bordered", "condensed"), fixed_thead = T, full_width = F)%>%
  scroll_box(width = "780px")

```


## Linear Regression

Here, we do single and multiple linear regreassion on the `narcissism_data`:

```{r}
m1 <- lm(mental_health~narcissism, data= narcissism_data)
```

```{r message=F, eval=T, echo=F, fig.align='center', dpi=300}
broom::tidy(m1)%>%
  knitr::kable(digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "bordered", "condensed"), fixed_thead = T, full_width = T)
```

```{r}
m2 <- lm(mental_health~narcissism+psychopathy, data= narcissism_data)
```

```{r message=F, eval=T, echo=F, fig.align='center', dpi=300}
broom::tidy(m2)%>%
  knitr::kable(digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "bordered", "condensed"), fixed_thead = T, full_width = T)
```

* *Exercise*: Trémolière and Djeriouat (2020) examined the role of *cognitive reflection* and *belief in science* in climate change skepticism. In their first study, they revealed that cognitive reflection and belief in science negetively predicted climate change skepticism even after controlling for demographic and cognitive ability variables (see the original paper [here](https://psyarxiv.com/vp8k6/)). 

```{r echo=FALSE, out.width="700px", out.height="350px", fig.cap= "adapted from [Trémolière and Djeriouat (2020)](https://psyarxiv.com/vp8k6/)"}
knitr::include_graphics(here('inputs','tremoliere_reg.png'))
```

Open the `tremoliere_data_exp1.csv` file and try to reproduce their results by running a multiple linear regression. Enter age, gender, education, belief in science, literacy, numeracy (Numtotal), and cognitive reflection as predictors and enter climate change skepticism (climato) as the outcome variable.

```{r message=F}
Tremoliere_data <- read_csv(here("cleaned_data","tremoliere_data_exp1.csv"))

Tremoliere_reg=lm(Climato ~ Age+ Gender+ Education+ BeliefInSciencetotal+ Literacy+ Numtotal+ CognitiveReflection,
                    data=Tremoliere_data)
```


```{r message=F, eval=T, echo=F, fig.align='center', dpi=300}
broom::tidy(Tremoliere_reg)%>%
  knitr::kable(digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "bordered", "condensed"), fixed_thead = T, full_width = T)

glance(Tremoliere_reg)%>%
  knitr::kable(digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "bordered", "condensed"), fixed_thead = T, full_width = F)%>%
  scroll_box(width = "780px")
```


# Rmarkdown

To be completed...


```{r echo=FALSE, out.width="700px", out.height="350px", fig.cap= "Artwork by Allison Horst: https://github.com/allisonhorst/stats-illustrations"}
knitr::include_graphics(here('inputs','rmarkdown_wizards.png'))
```


```{r echo=FALSE, out.width="700px", out.height="350px", fig.cap= "Artwork by Allison Horst: https://github.com/allisonhorst/stats-illustrations"}
knitr::include_graphics(here('inputs','reproducibility_court.png'))
```

# References

- Ghasemi, O., Handley, S., & Howarth, S. (2020). The Bright Homunculus in our Head: Individual Differences in Intuitive Sensitivity to Logical Validity.

- John, L. K., Jeong, M., Gino, F., & Huang, L. (2019). The self-presentational consequences of upholding one’s stance in spite of the evidence. Organizational Behavior and Human Decision Processes, 154, 1-14.

- Pennycook, G., Cheyne, J. A., Koehler, D. J., & Fugelsang, J. A. (2020). On the belief that beliefs should change according to evidence: Implications for conspiratorial, moral, paranormal, political, religious, and science beliefs. Judgment and Decision Making, 15(4), 476.

- Rotello, C. M., Kelly, L. J., Heit, E., Vazire, S., & Vul, E. (2018). The Shape of ROC Curves in Shooter Tasks: Implications for Best Practices in Analysis. Collabra: Psychology, 4(1).

- Trémolière, B., & Djeriouat, H. (2020). Don’t you see that its cold! Exploring the roles of cognitive reflection, climate science literacy, illusion of knowledge, and political orientation in climate change skepticism.

- Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10), 1-23.